Visualizing differential geometry in Jupyter notebooks

(you can download this post as a notebook here)

Visualizing differential geometry in Jupyter notebooks

I taught a senior seminar on differential geometry last year. I'll be honest: it was a selfish course. Markus Deserno writes all of these cool papers about membranes, and I didn't know enough math to follow them. So, the goal of our course was basically to learn enough differential geometry to read several of his papers. It was fantastic. We used several of his papers, as well as Kreyszig's textbook, as our core materials.

Here's the thing: a lot of this was pretty foreign to my physics students. In particular, the discussion of surfaces and mappings was new. So, we wrote some tools in Jupyter Notebooks to help us visualize and solve problems. I particularly like the stuff we wrote to visualize a mapping, and I don't know of a comparable resource elsewhere. Let's jump in.

The punchline: how do you map a surface to a curve?

As a teaser, here's how $(u_1,u_2) \to (u_1+u_2, (u_1+u_2)^2, (u_1+u_2)^3)$ maps a surface to a curve. As you play the animation, the green surface slowly interpolates from the $(u_1, u_2)$ patch to the curve, following the red arrows. Watching this work for several different mappings cleared a lot up for my class.

In [18]:
 
Out[18]:


Once Loop Reflect

OK. Cool. Now let's talk differential geometry for a bit.

Surfaces

We get our introduction to surfaces in Kreyszig Ch. 3. Specifically, $\S3.24$ In most of our classes, we've defined surfaces like

$$z = f(x,y) $$

and we've been good at visualizing them. Here, we run into the definition of a surface in terms of a mapping that sends

$$ (u_1,u_2) \to (x(u_1,u_2), y(u_1,u_2), z(u_1,u_2)) $$

and Kreyszig warns us that something like

$$ (u_1+u_2, (u_1+u_2)^2, (u_1+u_2)^3) $$

is problematic because, since it can be parameterized by a single variable, it's actually a curve rather than a surface. We also run into the Jacobian and some linear algebra. Our goal right now is to work with concrete curves and surfaces in $\mathbb{R}^3$ So, let's figure out how to visualize these sorts of mappings. First, some gunk to set up our python environment. For the experts, note that we could use %matplotlib notebook to make things prettier, but it seems really slow.

In [1]:
import numpy as np, scipy as sp, pandas as pd, seaborn as sns, matplotlib.pyplot as plt, matplotlib as mpl
import itertools
from mpl_toolkits.mplot3d import Axes3D, proj3d
import matplotlib.animation as animation
from matplotlib.patches import FancyArrowPatch
from IPython.display import HTML
from ipywidgets import interact, interactive, fixed

In class, we used interact to visualize things. However, I'm making this available on the web using the fact that Jake Vanderplas' awesome JSAnimation package is built in to matplotlib 2.1. I still strongly recommend interact for playing around on your own server.

In [ ]:
# mpl.rcParams['animation.html'] = 'jshtml'
# Doesn't seem to work for me, so I make the HTML explicitly below

Now, let's just try one of these out. One of the examples we want to look at in the book is

$$ (u_1+u_2, \sin(u_1), \cos(u_1+u_2)) $$

so let's make a grid of points where $u_1$ and $u_2$ vary from -2 to 2, then plot the corresponding surface. We'll plot $(u_1, u_2, 0)$ in grey and the surface in blue.

In [2]:
u1 = np.linspace(-2,2,100)
u2 = np.linspace(-2,2,100)
U1, U2 = np.meshgrid(u1, u2)
X = U1 + U2
Y = np.sin(U1)
Z = np.cos(U1+U2)

ax = plt.gca(projection='3d')
surf2 = ax.plot_surface(U1, U2, np.zeros_like(U1), color='grey', alpha=0.5)
surf = ax.scatter(X, Y, Z)
plt.show()

You can play around with the above, using more or fewer points, etc. You can clearly see that the surface extends vertically and horizontally beyond the patch from $u_1$ and $u_2$, although it's kind of cheating to plot them both at the same time. So let's wrap that up in a function and try it with different maps.

To make it cuter, we'll draw red arrows from some specific points to see which points get mapped where. And we'll plot a surface not a collection of points, by default, because it's way faster. Most impoartantly, we'll wrap all of the plotting stuff up in a function of its own so that we just need to define a map, and then ask to plot it.

In [3]:
class Arrow3D(FancyArrowPatch):
    #http://stackoverflow.com/questions/22867620/putting-arrowheads-on-vectors-in-matplotlibs-3d-plot
    def __init__(self, xs, ys, zs, *args, **kwargs):
        FancyArrowPatch.__init__(self, (0,0), (0,0), *args, **kwargs)
        self._verts3d = xs, ys, zs

    def draw(self, renderer):
        xs3d, ys3d, zs3d = self._verts3d
        xs, ys, zs = proj3d.proj_transform(xs3d, ys3d, zs3d, renderer.M)
        self.set_positions((xs[0],ys[0]),(xs[1],ys[1]))
        FancyArrowPatch.draw(self, renderer)

We played around with things quite a bit in class, so the function below has options for changing from surface to points ... in this notebook, I just stick with the surface, but you should play interactively!

In [4]:
def plot_surface(f,plotmode='surface',urange=2*np.pi):
    u1 = np.linspace(-urange,urange,100)
    u2 = np.linspace(-urange,urange,100)
    ax = plt.gca(projection='3d')
    U1, U2 = np.meshgrid(u1, u2)
    X,Y,Z = f(U1,U2)
    surf2 = ax.plot_surface(U1, U2, np.zeros_like(U1), color='grey', alpha=0.5)
    if plotmode == 'surface':
        surf = ax.plot_surface(X,Y,Z)
    elif plotmode == 'points':
        surf = ax.scatter(X,Y,Z)
    else:
        raise Exception('Unknown plot mode {p}'.format(p=plotmode))
    for (w1,w2) in itertools.permutations([-urange,-urange/2,0,urange/2,urange],2):
        x,y,z = f(w1,w2)
        a = Arrow3D([w1,x],[w2,y],[0,z], mutation_scale=20, 
                    lw=1, arrowstyle="-|>", color="r",
                   alpha=0.7,zorder=-1)
        ax.add_artist(a)
    plt.show()

And now let's plot that same surface $(u_1,u_2) \to u_1 + u_2, \sin(u_1), \cos(u_1+u_2)$

In [5]:
def f(X,Y):
    return X+Y, np.sin(X), np.cos(X+Y)
plot_surface(f)

Changing the $(u1,u2)$ patch

We spent a bit of time thinking about what exactly $u1$ and $u2$ were doing. If you make their range bigger or smaller, you change the size of the surface. That's built into the function defined above. By default, they each go from -urange to urange. As you can see in the figure above, urange defaults to $2\pi$. We can look at a smaller patch to see what's going on:

In [6]:
plot_surface(f,urange=2)

Playing around

At this point, the class just played around with the results. Everybody defined their own functions, and looked at the resulting maps. You can do that too if you run the notebook interactively! Here's one that the students particularly liked, despite the attempted bad math.

In [7]:
def f(X,Y):
    return X,np.log(Y),3*X+5*Y
plot_surface(f)
/Users/mglerner/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in log
  
/Users/mglerner/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: divide by zero encountered in log
  
/Users/mglerner/anaconda3/lib/python3.6/site-packages/mpl_toolkits/mplot3d/proj3d.py:141: RuntimeWarning: invalid value encountered in true_divide
  txs, tys, tzs = vecw[0]/w, vecw[1]/w, vecw[2]/w

The curve

OK. A large portion of why we did this was because it was really not obvious to the class why $(u_1+u_2, (u_1+u_2)^2, (u_1+u_2)^3)$ mapped to a curve. We could all follow the mathematical arguments, but it didn't really stick without a good visualization. So, let's look:

so, what does the curve look like?

In [8]:
def f(X,Y):
    return X+Y, (X+Y)**2, (X+Y)**3
plot_surface(f, plotmode='points')

That's a great start. We can see where representative points match, and it's definitely useful to visualize things and "prove" to ourselves that you do indeed get a curve. It's much more convincing when we make an interactive version (below). Before we do that, let's also quickly check out a cylinder.

In [9]:
def f(X,Y):
    return 2*np.cos(X),2*np.sin(X),Y
plot_surface(f)

In class, we played around with urange and the scale factors to make sure we understood what was going on.

An interactive one

That was OK, and it convinced us that the maps really did go from, e.g., a plane to a cylinder. But it still wasn't quite so obvious exactly how/why it worked.

Then I had the idea to make it interactive: you get a slider so you can watch the green surface interpolate linearly between the grey patch $(u_1,u_2)$ and the surface defined by $(x,y,z)$. This is worth running interactively, and playing with, as it really cleared things up. If you run it locally and interactively, you get sliders that let you rotate the surface around in 3D, so you can look at it from all angles. If you're runnign it locally, you wnat the interp step to be more like 0.01 and the degree step to be more like 1. For notebooks to be viewable online, I'm using a coarser step on interp, and I'm skipping the degree steps altogether.

In [10]:
def plot_surface_interpolate(f=f,plotmode='surface',urange=2*np.pi,interp=0.1,elev=None,azim=None):
    u1 = np.linspace(-urange,urange,100)
    u2 = np.linspace(-urange,urange,100)
    #fig = plt.figure()
    ax = plt.gca(projection='3d')
    U1, U2 = np.meshgrid(u1, u2)
    X,Y,Z = f(U1,U2)
    U3 = np.zeros_like(Z)
    A = (1-interp)*U1 + interp*X
    B = (1-interp)*U2 + interp*Y
    C = (1-interp)*U3 + interp*Z
    surf2 = ax.plot_surface(U1, U2, np.zeros_like(U1), color='grey', alpha=0.5)
    if plotmode == 'surface':
        surf = ax.plot_surface(X,Y,Z,alpha=0.9,color='blue')
        surf3 = ax.plot_surface(A,B,C,color='green',zorder=-2,alpha=0.9)
    elif plotmode == 'points':
        surf = ax.scatter(X,Y,Z,alpha=0.9,color='blue')
        surf3 = ax.scatter(A,B,C,color='green',zorder=-2,alpha=0.9)
    else:
        raise Exception('Unknown plot mode {p}'.format(p=plotmode))
    
    for (w1,w2) in itertools.permutations([-urange,-urange/2,0,urange/2,urange],2):
        x,y,z = f(w1,w2)
        a = Arrow3D([w1,x],[w2,y],[0,z], mutation_scale=20, 
                    lw=1, arrowstyle="-|>", color="r",
                   alpha=0.7,zorder=-1)
        ax.add_artist(a)
    ax.view_init(elev=elev,azim=azim)
    plt.show()
In [11]:
# You can change this to %matplotlib notebook if you want. It looks better, but is much slower for me.
#%matplotlib inline

Now play with the interp slider, to see how you'd interpolate linearly from $(u_1, u_2)$ to $(x,y,z)$

We just looked at mapping the plane to a cylinder, so we'll start with it. I made my students describe what they thought would happen first, which was a good idea, because the real mapping is actually a bit subtle.

In [12]:
interp_step = 0.01
degree_step=1
In [13]:
def plot_this(interp=0.1,azim=30,elev=30,plotmode='surface'):
    def f(X,Y):
        return 2*np.cos(X),2*np.sin(X),Y
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode)
interact(plot_this,interp=(0,1,interp_step),
         azim=(0,360,degree_step),
         elev=(0,360,degree_step),
         plotmode=['points','surface']);

Or, if you're looking at the blog post and want an animation where the slider controls the interpolation ...

First, the boilerplate code

In [14]:
def update_interp(i,f,surfs,elev=30,azim=30,urange=2*np.pi):

    surf,surf2,surf3 = surfs
    ax.clear()
    interp = i/10
    A = (1-interp)*U1 + interp*X
    B = (1-interp)*U2 + interp*Y
    C = (1-interp)*U3 + interp*Z

    surf2 = ax.plot_surface(U1, U2, np.zeros_like(U1), color='grey', alpha=0.5)
    surf = ax.plot_surface(X,Y,Z,alpha=0.5,color='blue')
    surf3 = ax.plot_surface(A,B,C,color='green',zorder=-2,alpha=0.9)

    for (w1,w2) in itertools.permutations([-urange,-urange/2,0,urange/2,urange],2):
        x,y,z = f(w1,w2)
        a = Arrow3D([w1,x],[w2,y],[0,z], mutation_scale=20, 
                    lw=1, arrowstyle="-|>", color="r",
                   alpha=0.7,zorder=-1)
        ax.add_artist(a)
    ax.view_init(elev=elev,azim=azim)
    return (surf,surf2,surf3)
urange=2*np.pi
u1 = np.linspace(-urange,urange,100)
u2 = np.linspace(-urange,urange,100)
U1, U2 = np.meshgrid(u1, u2)
U3 = np.zeros_like(U1)

Then the code for the cylinder mapping. Play with the slider to change the interploation! Step through frame-by-frame, or play the animation. Use the +/- controls to speed up/slow down the animation!

In [15]:
def f(X,Y):
    return 2*np.cos(X),2*np.sin(X),Y

X,Y,Z = f(U1,U2)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')


surf2 = ax.plot_surface(U1, U2, np.zeros_like(U1), color='grey', alpha=0.5)
surf = ax.plot_surface(X,Y,Z,alpha=0.9,color='blue')
surf3 = ax.plot_surface(X,Y,Z,color='green',zorder=-2,alpha=0.9)
surfs = (surf,surf2,surf3)

ani = animation.FuncAnimation(fig, update_interp, 10, fargs=(f,surfs),
                             interval=50, blit=False)

HTML(ani.to_jshtml())
Out[15]:


Once Loop Reflect

Can you see how it curls up in the middle? If not, maybe a different angle helps.

In [16]:
def f(X,Y):
    return 2*np.cos(X),2*np.sin(X),Y

elev, azim = 66,30
X,Y,Z = f(U1,U2)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')


surf2 = ax.plot_surface(U1, U2, np.zeros_like(U1), color='grey', alpha=0.5)
surf = ax.plot_surface(X,Y,Z,alpha=0.9,color='blue')
surf3 = ax.plot_surface(X,Y,Z,color='green',zorder=-2,alpha=0.9)
surfs = (surf,surf2,surf3)

ani = animation.FuncAnimation(fig, update_interp, 10, fargs=(f,surfs,elev,azim),
                             interval=50, blit=False)

HTML(ani.to_jshtml())
Out[16]:


Once Loop Reflect

This next one was, for us, particularly compelling. Seeing how you map a 2D surface into a curve was enlightening

And here we go, the main reason for this notebook:

In [ ]:
def plot_this(interp=0.1,azim=30,elev=30,plotmode='surface'):
    def f(X,Y):
        return X+Y, (X+Y)**2, (X+Y)**3
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode,urange=1)
interact(plot_this,interp=(0,1,interp_step),azim=(0,360,degree_step),elev=(0,360,degree_step),plotmode=['points','surface'])
In [17]:
def f(X,Y):
    return X+Y, (X+Y)**2, (X+Y)**3

elev, azim = 30,30
urange=1
u1 = np.linspace(-urange,urange,100)
u2 = np.linspace(-urange,urange,100)
U1, U2 = np.meshgrid(u1, u2)
U3 = np.zeros_like(U1)

X,Y,Z = f(U1,U2)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

#ax = p3.Axes3D(fig)

surf2 = ax.plot_surface(U1, U2, np.zeros_like(U1), color='grey', alpha=0.5)
surf = ax.plot_surface(X,Y,Z,alpha=0.9,color='blue')
surf3 = ax.plot_surface(X,Y,Z,color='green',zorder=-2,alpha=0.9)
surfs = (surf,surf2,surf3)

ani = animation.FuncAnimation(fig, update_interp, 10, fargs=(f,surfs,elev,azim,urange),
                             interval=50, blit=False)

HTML(ani.to_jshtml())
Out[17]:


Once Loop Reflect

More examples

After that, we did a bunch of examples from the book, and from our own imagination. I don't want this notebook to become too big, so I'm not going to make JSAnimations for all of them. Locally, I run them with interact ... I've commented out the interact line below and just put in a single plot line, but you should (have I said this yet?) download the notebook and play with it interactively!

In [30]:
def plot_this(interp=0.1,azim=113,elev=30,plotmode='surface'):
    def f(X,Y):
        return X+Y, np.sin(X), np.cos(X+Y)
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode,urange=2)
#interact(plot_this,interp=(0,1,interp_step),azim=(0,360,degree_step),elev=(0,360,degree_step),plotmode=['points','surface']);
plot_this()

Example 1 (two representations of the plane)

rectangular and polar. Note what happens when you drag the interp slider on the polar representation.

In [31]:
def plot_this(interp=0.1,azim=113,elev=30,plotmode='surface'):
    def f(U1,U2):
        return U1, U2, np.zeros_like(U2)
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode,urange=2)
#interact(plot_this,interp=(0,1,0.01),azim=(0,360,1),elev=(0,360,1),plotmode=['points','surface']);
plot_this()

This one trips the class up, because you want to say "nothing is going to change" ... but it does ...

In [32]:
def plot_this(interp=0.1,azim=113,elev=30,plotmode='surface'):
    def f(U1,U2):
        return U1*np.cos(U2), U1*np.sin(U2), np.zeros_like(U2)
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode,urange=2)
#interact(plot_this,interp=(0,1,0.01),azim=(0,360,1),elev=(0,360,1),plotmode=['points','surface']);
plot_this()

Example 3 (sphere of radius r) (transparency is kind of problematic here)

In [34]:
def plot_this(interp=0.1,azim=113,elev=30,plotmode='surface',r=1):
    def f(U1,U2,r=r):
        return r*np.cos(U2)*np.cos(U1), r*np.cos(U2)*np.sin(U1), r*np.sin(U2)
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode,urange=2*np.pi)
#interact(plot_this,interp=(0,1,0.01),azim=(0,360,1),elev=(0,360,1),plotmode=['points','surface'],
#        r=(0,10,0.2));
plot_this()   

Example 3

In [35]:
def plot_this(interp=0.1,azim=113,elev=30,plotmode='surface',a=1):
    def f(U1,U2,a=a):
        return U1*np.cos(U2), U1*np.sin(U2), a*U1
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode,urange=2)
#interact(plot_this,interp=(0,1,0.01),azim=(0,360,1),elev=(0,360,1),plotmode=['points','surface'],
#        a=(0,10,0.2));
plot_this()

Problem 25.1

In [36]:
def plot_this(interp=0.1,azim=113,elev=30,plotmode='surface'):
    def f(U1,U2):
        return 0, U1, U2
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode,urange=2)
#interact(plot_this,interp=(0,1,0.01),azim=(0,360,1),elev=(0,360,1),plotmode=['points','surface']);
plot_this()
In [37]:
def plot_this(interp=0.1,azim=53,elev=30,plotmode='surface'):
    def f(U1,U2):
        return U1+U2, U1+U2, U1
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode,urange=2)
#interact(plot_this,interp=(0,1,0.01),azim=(0,360,1),elev=(0,360,1),plotmode=['points','surface']);
plot_this()
In [38]:
def plot_this(interp=0.1,azim=113,elev=30,plotmode='surface'):
    def f(U1,U2):
        return np.cos(U1), np.sin(U1), U2
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode,urange=2)
#interact(plot_this,interp=(0,1,0.01),azim=(0,360,1),elev=(0,360,1),plotmode=['points','surface']);
plot_this()
In [39]:
def plot_this(interp=0.1,azim=113,elev=30,plotmode='surface'):
    def s(U1,U2):
        return U1+U2
    def h1(s):
        return s
    def h2(s):
        return s**2
    def h3(s):
        return s
    
    def f(U1,U2):
        S = s(U1,U2)
        return h1(S), h2(S), h3(S)
    def f(U1,U2):
        return h1(U1), h2(U1), h3(U1) + U2
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode,urange=2)
#interact(plot_this,interp=(0,1,0.01),azim=(0,360,1),elev=(0,360,1),plotmode=['points','surface']);
plot_this()

Monge gauge

Now we should do a specific example in the Monge gauge

In [40]:
def plot_this(interp=0.1,azim=113,elev=30,plotmode='surface'):
    def f(U1,U2):
        X = U1
        Y = U2
        def h(x,y):
            return 4+ np.cos(x) + np.sin(y)
        return X,Y,h(X,Y)
    plot_surface_interpolate(f,interp=interp,azim=azim,elev=elev,plotmode=plotmode,urange=2)
#interact(plot_this,interp=(0,1,0.01),azim=(0,360,1),elev=(0,360,1),plotmode=['points','surface']);
plot_this()

... and that's it. The whole thing was a ton of fun, and I'm surprised I haven't seen this sort of interpolation easily available before. Let me know if I've missed something!

Switching to Nikloa for Jupyter Notebooks and a static site

I've been using WordPress for quite a while, almost entirely because it's an out-of-the-box blog setup that just works. But it kind of sucks for what I mostly want to do, which is stick some code into blog posts. In fact, what usually happens is that I do something in a Jupyter notebook, and want to stick it up as a blog post. That's a real pain in WordPress. The best I found was converting the notebooks to html and then including them as a static block, but those invariably are brittle and ugly.

So, smart people like Fernando Perez and Michelle Gill switched over to something that deals natively with Jupyter notebooks a long time ago (so long ago they were called IPython Notebooks!). I'm a slow pony, but I'm switching to Nikola. It seems to be the easiest one at the moment. It's a static page generator, which is more than fine for my purposes, and it deals natively with Jupyter notebooks. Sweet. I thought it would be useful to document the process for future-me. I leaned heavily on the Nikola site (including the documentation for import_wordpress). The process wasn't completely trivial, but that's because I did some hacky stuff to get Jupyter Notebooks included in my WordPress posts anyway. This seems like a lot of work for like 13 posts, but nobody ever claimed I was wise.

Grabbing the site

Log in as an administrator, go to the Tools menu, click on Export, then download an Export File. Nice and easy. Thanks, WordPress!

Install Nikola

Note: pip doesn't actually play nicely with the conda environments. It looks like it installs nikola's stuff on top of everything else (e.g. the Nikola tools are available even when I'm not in the blog environment from below). That's not great. I probably should use a virtualenv, but I don't have time to learn what one of those is today.

I'm using Anaconda for my Python, and Nikola installs via pip, so I make a new environment just for the blog. Probably good practice anyway. Speaking of which, this is all Python 3. There's some weird bug involved in having pip upgrade setuptools,

conda create --name blog
source activate blog
conda install setuptools
pip install --upgrade "Nikola[extras]"
mkdir blog
cd blog
mkdir posts
mkdir output

Now import the wordpress site. If I just say

nikola import_wordpress ~/Downloads/biophysicsandbeer.wordpress.2017-10-10.xml

I don't actually get some of my content imported. In particular, the quoted Python code doesn't show up. So,

nikola import_wordpress --download-auth=[my secret username]:"[my secret password]" --install-wordpress-compiler --transform-to-html ~/Downloads/biophysicsandbeer.wordpress.2017-10-10.xml

Tells it to be explicit about converting to html and downloading everything it can. It also produces some errors, specifically a bunch of lines like

[2017-10-11T19:49:10Z] WARNING: Nikola: Can't do a redirect for: 'http://www.mglerner.com/blog/?p=8/'

which are telling us that it can't just spit out a new file in place of the old one because the question marks in the URL mean request data. We'll deal with this in a minute.

We then fire up the site with

cd new_site
nikola build
nikola serve --browser

And it almost all works. Here are the missing things:

My pre-rendered Jupyter notebooks suck.

I used the pageview WordPress plugin (short code? I don't know) to include pre-rendered Jupyter notebooks. That just breaks upon import. But the Nikloa posts are just straight HTML. I can just include the stuff directly. That was pretty easy. In my future free time (ha!), I might go back and convert some of them directly to Jupyter notebooks.

Comments are all missing.

  1. For future stuff to work, I'll get a Disqus ID and start using it. (biophysics-and-beer ... check).

  2. Add all of the old comments by hand. This isn't taking a huge amount of time, because I don't have a huge amount of comments, and I'm not going crazy with matching formatting. I'm just copying the comments div from the old blog. If you're doing this to your own blog, don't forget to get rid of the "response" part of the comments div. If I had it to do over, I'd write a script that did that for me, or I'd learn how to import comments into Disqus, but it didn't take much time this way. On the plus side, I'm ditching some (not all) spam I had missed.

TODO

  • The twitter follow plugin doesn't work. I think that should be straightforward, but I'll fix it later.
  • Get a graphical header back. I have all of the images downloaded, so I just need to figure out how the css/templates work. Maybe make my own theme. But I tried using one of the built-in Nikola themes (srcco.de) and I ended up running into more trouble than I wanted to deal with. This will take an hour or two later.

Back to the redirects

Those redirects are kind of a pain. That's Nikola saying it can't just spit out a new file in place of the old file because the question marks mean request data. Thanks, wordpress! I wanted simple stuff like an .htaccess file saying

Redirect /blog/page8 /blog/new-and-fancy.html

but no ... Now I need to look at a StackOverflow answer and remember why I stopped being a web developer years ago. Here's my new .htaccess file

RewriteEngine on
RewriteBase /

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /blog/\?p=5/ [NC]
RewriteRule ^ /blog/posts/the-blogging-begins.html [L,R=301]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /blog/\?p=5 [NC]
RewriteRule ^ /blog/posts/the-blogging-begins.html [L,R=301]

to handle people accessing each page both with and without a trailing slash. Now, where should I point the redirects?

Static site means why host it myself?

It made sense to host my own WordPress site. But I'm not sure I need to host my own static site. In fact, life would probably be easier if I just used github pages. Cool. Github has instructions that basically boil down to "set up a repo called mglerner.github.io and do some special things" ... and Nikola means I don't have to actually know what those special things are! After initializing my Nikola blog directory as a github repo and doing

git remote add origin git@github.com:mglerner/mglerner.github.io.git

I can make Nikola rebuild things like github wants, add, commit and push everything automatically with

nikola github_deploy

So, after having done that, I just make the .htaccess file listed above point everything from my old site to my fancy new github site.

... and it's up and running!

Making it possible to post with Jupyter notebooks

This is documented in a few places, but sometimes things change, so you're probably better off just looking at the Nikola site directly. At the moment, you need to add the last line here to each of POSTS and PAGES in /blog/conf.py, so that Nikola will recognize Jupyter notebooks as ... you know ... both posts and pages. You also need to add the trailing line in COMPILERS.

POSTS = (
    ("posts/*.rst", "posts", "post.tmpl"),
    ("posts/*.txt", "posts", "post.tmpl"),
    ("posts/*.md", "posts", "post.tmpl"),
    ("posts/*.html", "posts", "post.tmpl"),
    ("posts/*.ipynb", "posts", "post.tmpl"),
)

PAGES = (
    ("pages/*.rst", "pages", "story.tmpl"),
    ("pages/*.txt", "pages", "story.tmpl"),
    ("pages/*.md", "pages", "story.tmpl"),
    ("pages/*.html", "pages", "story.tmpl"),
    ("pages/*.ipynb", "pages", "story.tmpl"),
    )

COMPILERS = {
    "rest": ('.txt', '.rst'),
    "markdown": ('.md', '.mdown', '.markdown'),
    "html": ('.html', '.htm'),
    "ipynb": ('.ipynb',),
}

Making posts

Now comes the whole point: making new posts. Blog posts come with meta data (when it was created, etc.) that normally gets automatically handled when you create a new post via nikola new_post -e. If you want to start with other formats (e.g. Markdown or Jupyter notebooks), you just stick the relevant file in /posts and import it via nikola new_post -i.

I took the notes for this migration in Markdown, so I just stick that file in /posts/BlogMigration.md run nikola new_post -i posts/BlogMigration ...

Or, at least, that's how it should work. I'm a little disappointed to find that the new post import keeps breaking, saying it can't find metadata like the date. It's an easy workaround for markdown: I just make a new post, then copy over my markdown. Looking at the results, I find that Nikola wants the metadata imbedded in the markdown file. So, I should have started the markdown file with

<!--
.. title: Switching to Nikloa for Jupyter Notebooks and a static site
.. slug: switching-to-nikloa-for-jupyter-notebooks-and-a-static-site
.. date: 2017-10-12 13:50:23 UTC
.. tags: 
.. category: 
.. link: 
.. description: 
.. type: text
-->

Cool. I tested, and I can indeed just stick that at the top of my markdown file (which I'm storing in a folder called drafts) and then import. Almost. Of all things, it sticks that metadata in twice. So, for now, the clear answer is just to make a blank post and then copy in my markdown. Easy enough.

Meanwhile, importing Jupyter notebooks seems to just work!

then nikola build, then fire up a test server to check it out with nikola serve --browser, verify that all looks awesome, and push it to the blog with nikola github_deploy!

Can I be smarter about late policies?

Questions: Is my late policy reasonable? Are there diversity implications for smart late policies?

Robin DeRosa had an interesting tweet about late policies recently, and I posted my late policy in reply. Here’s a slightly expanded version:

In most of my classes, late work happens because students are really busy, not because they’re slackers. That means a late policy with percentage deductions kind of sucks, because my students will also be really busy the next week. Instead, I combine “no late work accepted” with dropping the equivalent of one week’s worth of each assignment time. E.g. in a class that meets three times a week, I throw out three of the daily assignments.

I make sure to frame this in a discussion with the students, where I explain that the policy is an explicit recognition of the fact that they’re busy. If you’re too busy to get the work done on time, JUST SKIP it, and get your life caught up.

So far, it has been working out really well. The students appreciate the extra lever for managing their schedules, and it’s clear from the beginning that there won’t need to be any exceptions. Note: every semester so far, students have managed to get confused early on … luckily, this comes up in terms of one of those low-weight daily assignments, so we clear it up before a high-stakes situation shows up).

I know this won’t work well for everyone, especially people whose course materials are quite different from mine. I’ve been able to use it well in both intro and upper-level classes, though. It has been easy to adapt to group work as well: groups always get to evaluate each other as part of the process, so they’re in charge of some of the enforcement.

I know other people who have similar policies, but with more bookkeeping. For example, people hand out “extensions” or “free passes” at the beginning of the term. It rewards organization, etc. I don’t think that serves my policy as well. For one thing, you have to have a good bookkeeping system; if you hand out literal extension passes, you have to either make sure students aren’t sharing them with each other, or be OK with that economy. For another thing, I like the raw simplicity of my policy, and I think that it might help in terms of normalizing things between students. I could be wrong!

After that twitter thread, I have some questions:

Are there diversity implications for smart late policies? One of the obvious things I know about supporting diverse groups of students is that you need to recognize the fact that there’s a whole world out there affecting in-class and out-of class performance. One of my hopes with my policy is that it’s explicitly good in this regard, but I’d love feedback here.

Should I add tiered deadlines? Matthew Cheney has a nice policy, where he distinguishes between “hard” and “soft” deadlines, explicitly telling the students which is which, and making that part of the late policy for grading. That seems honest and good, but I like the raw simplicity of my approach, so I’m nervous about adding more to it.

How does this affect colleagues? My department is young, so maybe this is a particular issue for us, but some of us tend to get more pushback from students than others (instead of young, you could probably insert all sorts of gender dynamics, etc. here). In several of our classes, we standardized on my above policy as the departmental policy. Being able to refer to it as The Departmental Policy seems to help those colleagues out.

Comments from Old Blog

3 Responses to Can I be smarter about late policies?

  1. Kat Bartlow says:

    Interesting take. In my classes, I usually have one set of low-stakes assignments (homework, quizzes on Moodle, etc.) My policy on this is to not accept late work, but to only require students to turn in 10 of 15 possible assignments. On the other hand, for major assignments like papers, I do deduct percentage points per day of lateness unless the student makes a case for an extension before the paper deadline. I think there's value in recognizing student busy-ness and not driving them to extremes (particularly on your pressure-cooker campus), but there's also value in reminding them that in order to achieve your goals, you have to set priorities and even (gasp!) work to deadlines.

Optimizing MSD calculations

[twitter-follow username="mglerner" scheme="dark"]

This whole post is available as an IPython Notebook here.

Writing a faster mean-squared-displacement function

I'm trying to calculate the Mean Squared Displacement (MSD) of a particle trajectory. In reality, I have an array of positions for numtrj particles and numpts timepoints and dim dimensions: pos[numtrj, numpts, dim]. I think my question has the same answer if I just have the x trajectory of a single particle, though.

In case you haven't done MSD calculations before, there's one cute way in which you get extra information out of a trajectory. Say I have just five time points, and my positions are

In [83]: x

Out[83]: array([ 0.        ,  1.74528704,  1.59639865,  2.59976219,  3.70852457])

You could just get squared displacement by looking at x**2. However, you could also say that you have 4 different values for the displacement at one timestep:

x[1] - x[0], x[2] - x[1], x[3] - x[2], x[4] - x[3]

Similarly, three values for displacement at two timesteps: x[2:] - x[:-2]. Etc. So, the way I'm calculating MSD at the moment is:

def msd_1d(x):
    result = np.zeros_like(x)
    for i in range(1,len(x)):
        result[i] = np.average((x[i:] - x[:-i])**2)
    return result

or

def get_msd_traj(pos):
    result = np.zeros_like(pos)
    for i in range(1,pos.shape[1]):
        result[:,i,:] = np.average((pos[:,i:,:] - pos[:,:-i,:])**2,axis=1)
    return result

(side note: often the data comes indexed like pos[numpts, numtrj, dim] for molecular dynamics trajectories, but that doesn't change anything here)

So, I asked Joshua Adelman if he had any quick thoughts.

Josh cleared up some basic misconceptions for me and gave a couple of suggestions:

From Josh

Hi Michael,

In general, numba is not going to speed up calculations that involve numpy built-in methods (unless they are things like np.exp, np.sin, np.cos) or array slicing. It works best if you have unrolled any vectorization into explicited nested loops and operate on arrays element-by-element. If you go that route, check to see if you can get the code to compile using @jit(nopython=True), which would be indicitive of numba being able to translate the code to llvm IR without using python objects (which tend to slow things down). I've also had a lot of success lately parallelizing numba code using threading if you aren't dealing with any python objects. If you can compile in nopython mode, then you can also specify:

@jit(nopython=True, nogil=True)

and use a ThreadPool to chop up the work, using something like:

http://numba.pydata.org/numba-doc/0.19.1/user/examples.html?highlight=nogil#multi-threading

it's undocumented, but for simple stuff you can use:

from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=nthreads)
results = pool.map(func, arg)

It has the same API as the multiprocessing pool, and for embarassingly parallel tasks that don't have possible race conditions, it works quite nicely depending on the overhead involved in spawning off the task relative to the cost of the task.

Also, make sure you're using the latest version of numba since it's getting better rapidly. I'm not sure if you'll be able to call np.zeros_like in a numba-ized method in nopython mode, although you might. If not, you can pass in the results array as an argument.

Hope that helps. I can't think of any sneaky vectorization tricks in pure-numpy off the top of my head to make your calculation faster. Let me know if you have any other questions.

Josh

Back to me

So, I tried several things. You have to do some fiddling around to get nopython=True to work. In addition to not knowing about np.average, things like x[1:] - x[:-1] are doomed to failure. However, that last part isn't really trouble, because one of Josh's points was that I should write out all of my loops explicitly. Here are two versions of the 1D version:

In [1]:
import numpy as np
from numba import jit
In [2]:
def msd_1d(x):
    result = np.zeros_like(x)
    for i in range(1,len(x)):
        result[i] = np.average((x[i:] - x[:-i])**2)
    return result

@jit(nopython = True)
def msd_1d_nb1(x):
    result = np.zeros_like(x)
    for delta in range(1,len(x)):
        thisresult = 0
        for i in range(delta,len(x)):
            thisresult += (x[i] - x[i-delta])**2
        result[delta] = thisresult / (len(x) - delta)
    return result
    
@jit(nopython = True)
def msd_1d_nb2(x):
    result = np.zeros_like(x)
    for delta in range(1,len(x)):
        for i in range(delta,len(x)):
            result[delta] += (x[i] - x[i-delta])**2
        result[delta] = result[delta] / (len(x) - delta)
    return result

Note that the above definitions will work almost no matter what you do. You don't get the numba issues until you try to actually run the functions. Here are some timings:

In [3]:
%timeit msd_1d(np.random.randn(10))
%timeit msd_1d_nb1(np.random.randn(10))
%timeit msd_1d_nb2(np.random.randn(10))
The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 114 µs per loop
The slowest run took 11972.39 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 7.02 µs per loop
The slowest run took 14872.73 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 6.94 µs per loop
In [4]:
%timeit msd_1d(np.random.randn(100))
%timeit msd_1d_nb1(np.random.randn(100))
%timeit msd_1d_nb2(np.random.randn(100))
1000 loops, best of 3: 1.19 ms per loop
100000 loops, best of 3: 15.8 µs per loop
100000 loops, best of 3: 18.5 µs per loop
In [5]:
%timeit msd_1d(np.random.randn(1000))
%timeit msd_1d_nb1(np.random.randn(1000))
%timeit msd_1d_nb2(np.random.randn(1000))
100 loops, best of 3: 13.3 ms per loop
1000 loops, best of 3: 540 µs per loop
1000 loops, best of 3: 594 µs per loop

So, pretty big win there! Approximately two orders of magnitude. It looks like it scales well too.

The larger version

So, it looks like there's not much of a difference between version 1 and 2. But it does look like version 1 is a bit faster.

In [6]:
def get_msd_traj(pos):
    result = np.zeros_like(pos)
    for i in range(1,pos.shape[1]):
        result[:,i,:] = np.average((pos[:,i:,:] - pos[:,:-i,:])**2,axis=1)
    return result

@jit(nopython = True)
def get_msd_traj_nb1(pos):
    result = np.zeros_like(pos)
    deltastop = pos.shape[1]
    for traj in range(pos.shape[0]):
        for dim in range(pos.shape[2]):
            for delta in range(1,deltastop):
                thisresult = 0
                for i in range(delta,deltastop):
                    thisresult += (pos[traj,i,dim] - pos[traj,i-delta,dim])**2
                result[traj,delta,dim] = thisresult / (deltastop - delta)
    return result

First, let's do some due diligence and make sure the results are equivalent. Then, timings.

In [7]:
a = np.random.randn(5,3,2)
np.all(get_msd_traj(a) == get_msd_traj_nb1(a))
Out[7]:
True
In [8]:
%timeit get_msd_traj(np.random.randn(10,10,2))
%timeit get_msd_traj_nb1(np.random.randn(10,10,2))
1000 loops, best of 3: 200 µs per loop
100000 loops, best of 3: 14.6 µs per loop
In [9]:
%timeit get_msd_traj(np.random.randn(100,100,2))
%timeit get_msd_traj_nb1(np.random.randn(100,100,2))
10 loops, best of 3: 12.3 ms per loop
1000 loops, best of 3: 1.84 ms per loop
In [10]:
%timeit get_msd_traj(np.random.randn(512,2001,2))
%timeit get_msd_traj_nb1(np.random.randn(512,2001,2))
1 loops, best of 3: 27.8 s per loop
1 loops, best of 3: 2.47 s per loop

So, for the actual data sizes I care about, I'm probably down to "only" an order of magnitude speedup. That's still pretty awesome.

Comments from Old blog

One Response to Using numba to speed up mean-squared displacement calculations

  1. mglerner says:

    On twitter, Konrad Hinsen pointed out that I could get a bigger speedup (likely a factor of 1000, but it's NlogN instead of N^2, so it only gets better) by using a better algorithm. Several years ago, I did try to use an FFT-based method. I was sketched out by the fact that the results I obtained were similar to the exact results, but not identical. It sounds likely that his method is just better. The paper claims it's exact, and a followup conversation indicates that they get better accuracy by doing fewer calculations (thus less accumulated error). Filed away for next time!

MMMC2015

[twitter-follow username="mglerner" scheme="dark"]

This whole post is available as an IPython Notebook here.

MMMC the 2015 update

See the original blog post for details and history. Here's the short story: in my Statistical and Thermal Physics class, we want to use Monte Carlo simulations to generate brackets for March Madness. There are at least two obvious ways to go about this:

  1. Make some function that tells us the chance that team A beats team B, then flip coins for each matchup. That gets you one bracket. Repeat 100,000 times, collect statistics. This is the way Nate Silver's 538.com handles simulations for basketball, elections, etc, and I should probably implement it (note to self/motivated students: it's as easy as just generating 100,000 new brackets at a given temperature).

  2. Generate one bracket, then do a Monte-Carlo walk through bracket space. This is tougher. We have to figure out how to make a move in bracket space, which is part of the fun of Monte Carlo simulations in general. To see how this is done, check out the code in Bracket.swap and Brackets.simulate.

As you can tell, we take option 2 above. I've made things a bit nicer from a user standpoint this year; here's a walkthrough. First, load up our standard IPython setup

In [1]:
from __future__ import division
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
import numpy as np
from matplotlib import pyplot as plt
from IPython.display import HTML

The basics

make me a single bracket at a given temperature

In [2]:
import MarchMadnessMonteCarlo as MMMC
teams = MMMC.teams['south']
b = MMMC.Bracket(teams=teams,T=0.5)
print b
Duke (1)                                                     
Robert Morris (16)        Duk (1)                            
San Diego St. (8)                                            
St. John's (9)            San (8)  Duk (1)                   
Utah (5)                                                     
Stephen F. Austin (12)    Ste (12)                           
Georgetown (4)                                               
Eastern Washington (13)   Geo (4)  Ste (12) Ste (12)         
SMU (6)                   UCL (11) UCL (11) UCL (11) Ste (12)
UCLA (11)                                                    
Iowa St. (3)              Iow (3)                            
UAB (14)                                                     
Iowa (7)                  Iow (7)  Iow (7)                   
Davidson (10)                                                
Gonzaga (2)               Gon (2)                            
North Dakota St. (15)                                        
Total bracket energy: -17.4533174222

Now, instead of asking for all of the teams in the South, make up a set of Final Four teams, and run 1000 simulations. Show some statistics

In [3]:
sr = MMMC.simulate(1000,['Kentucky','Wisconsin','Villanova','Duke'],0.5)
MMMC.showstats(sr,newfig=True)
Lowest energy bracket
Kentucky (1)                               
Wisconsin (1)             Ken (1)          
Villanova (1)             Vil (1)  Ken (1) 
Duke (1)                                   
Total bracket energy: -3.05919025906

Most common bracket (159)
Kentucky (1)                               
Wisconsin (1)             Ken (1)          
Villanova (1)             Duk (1)  Ken (1) 
Duke (1)                                   
Total bracket energy: -3.04122415391

As you can see, we fully sample bracket space pretty quickly (look at the graph in the top right, with the unique brackets shown).

What should our temperature be?

If we had chosen option 1 at the top, we'd just flip coins with a given probability of winning. Here (and this may be a questionable decision), we need to set an overall temperature for our simulation. Intuitively, the higher the temperature, the closer we come to a random outcome. The lower the temperature, the closer we come to a "best seed always wins" bracket. If we're going to make sense of temperature, we should pick a reasonable energy function.

We can use KenPom's log5 defined as

def log5_energy_game(winner, loser):
    A,B = strength[winner],strength[loser]
    # see http://207.56.97.150/articles/playoff2002.htm
    win_pct = (A-A*B)/(A+B-2*A*B)
    return -win_pct

Conveniently, that's coded up for you in MarchMadnessMonteCarlo.examples.

Later on, we could make a fancy energy function with, e.g., a weighted average of KenPom, Jeff Sagarin, and the NCAA rankings.

In [4]:
import MarchMadnessMonteCarlo.examples
MMMC.set_energy_function(MarchMadnessMonteCarlo.examples.log5_energy_game)

Now, what should our actual temperature be? Historically, we know that an 8 seed vs. a 9 seed should essentially be a tossup. So, as a proxy here, we could just look at the chance of an 8 seed winning over a range of temperatures, and pick the point where it's pretty close to 0.5.

In [5]:
def winpct8(team8,team9,T,numtrials=10000):
    results = [MMMC.playgame(team8,team9,T)[0] == team8 for i in range(numtrials)]
    return np.average(results)
def plotwins(team8,team9,numtrials=10000):
    Ts = np.linspace(0,3,100)
    pct = [winpct8(team8,team9,T,numtrials) for T in Ts]
    plt.plot(Ts,pct,label='{t1} vs. {t2}'.format(t1=team8,t2=team9))
    plt.xlabel('T')
    plt.ylabel('winpct')

We'll look at it for all four of the 8 vs. 9 matchups

In [6]:
plotwins('Cincinnati','Purdue')
plotwins('Oregon','Oklahoma St.')
plotwins('North Carolina St.','LSU')
plotwins('San Diego St.',"St. John's")
plt.legend()
Out[6]:
<matplotlib.legend.Legend at 0x7fc7e94c7650>

A couple of things jump out. First, the green line: KenPom has Oregon ranked below Oklahoma State, so the curve approaches 50% from below instead of from above.

Second, just eyeballing that, it looks like we could make a legitimate argument for almost anything between T=1.0 and 2.0.

What does that do to "clear" matchups? Well, Kentucky is a big favorite over Kansas with this model:

In [7]:
MarchMadnessMonteCarlo.examples.log5_energy_game('Kentucky','Kansas')
Out[7]:
-0.8176307825952608
In [8]:
plotwins('Kentucky','Kansas')

If I pick T=1.5, I get Kentucky, the favorite, winning 60% of the time. You can feel free to choose a "saner" temperature, but I'm rooting for KU here!

Running some brackets!

So what does a bracket happen to look like? Well, in the original blog post, I mentioned that we came up with two different ways to run brackets. It can take a while to run a bracket, so a fast way (implemented as runbracket2) is to run separate brackets for each of the individual regions, then take the winner from each region's most common bracket to form a Final Four. Below, we do just that, running 10,000 trials for each of the regions, and 1000 trials for the Final Four.

In [9]:
results = MMMC.runbracket2(ntrials1=10000,ntrials2=1000,T=1.5)
YOUR LOWEST ENERGY BRACKETS
LOWEST ENERGY BRACKET FOR REGION midwest
Kentucky (1)                                                 
Hampton (16)              Ken (1)                            
Cincinnati (8)                                               
Purdue (9)                Cin (8)  Ken (1)                   
West Virginia (5)                                            
Buffalo (12)              Wes (5)                            
Maryland (4)                                                 
Valparaiso (13)           Mar (4)  Wes (5)  Ken (1)          
Butler (6)                Tex (11) Not (3)  Not (3)  Ken (1) 
Texas (11)                                                   
Notre Dame (3)            Not (3)                            
Northeastern (14)                                            
Wichita St. (7)           Wic (7)  Kan (2)                   
Indiana (10)                                                 
Kansas (2)                Kan (2)                            
New Mexico St. (15)                                          
Total bracket energy: -10.5524230049


LOWEST ENERGY BRACKET FOR REGION west
Wisconsin (1)                                                
Coastal Carolina (16)     Wis (1)                            
Oregon (8)                                                   
Oklahoma St. (9)          Okl (9)  Wis (1)                   
Arkansas (5)                                                 
Wofford (12)              Ark (5)                            
North Carolina (4)                                           
Harvard (13)              Nor (4)  Nor (4)  Wis (1)          
Xavier (6)                Xav (6)  Bay (3)  Ari (2)  Ari (2) 
Mississippi (11)                                             
Baylor (3)                Bay (3)                            
Georgia St. (14)                                             
VCU (7)                   Ohi (10) Ari (2)                   
Ohio St. (10)                                                
Arizona (2)               Ari (2)                            
Texas Southern (15)                                          
Total bracket energy: -10.8592988759


LOWEST ENERGY BRACKET FOR REGION south
Duke (1)                                                     
Robert Morris (16)        Duk (1)                            
San Diego St. (8)                                            
St. John's (9)            San (8)  Duk (1)                   
Utah (5)                                                     
Stephen F. Austin (12)    Uta (5)                            
Georgetown (4)                                               
Eastern Washington (13)   Geo (4)  Uta (5)  Duk (1)          
SMU (6)                   SMU (6)  Iow (3)  Gon (2)  Gon (2) 
UCLA (11)                                                    
Iowa St. (3)              Iow (3)                            
UAB (14)                                                     
Iowa (7)                  Iow (7)  Gon (2)                   
Davidson (10)                                                
Gonzaga (2)               Gon (2)                            
North Dakota St. (15)                                        
Total bracket energy: -10.5015086048


LOWEST ENERGY BRACKET FOR REGION east
Villanova (1)                                                
Lafayette (16)            Vil (1)                            
North Carolina St. (8)                                       
LSU (9)                   Nor (8)  Vil (1)                   
Northern Iowa (5)                                            
Wyoming (12)              Nor (5)                            
Louisville (4)                                               
UC Irvine (13)            Lou (4)  Nor (5)  Vil (1)          
Providence (6)            Pro (6)  Okl (3)  Vir (2)  Vir (2) 
Dayton (11)                                                  
Oklahoma (3)              Okl (3)                            
Albany (14)                                                  
Michigan St. (7)          Mic (7)  Vir (2)                   
Georgia (10)                                                 
Virginia (2)              Vir (2)                            
Belmont (15)                                                 
Total bracket energy: -10.8839299775


LOWEST ENERGY BRACKET FOR FINAL FOUR
Kentucky (1)                               
Wisconsin (1)             Ken (1)          
Utah (5)                  Vir (2)  Ken (1) 
Virginia (2)                               
Total bracket energy: -1.95692907302

YOUR MOST COMMON BRACKETS
MOST COMMON BRACKET FOR REGION midwest
Kentucky (1)                                                 
Hampton (16)              Ken (1)                            
Cincinnati (8)                                               
Purdue (9)                Pur (9)  Ken (1)                   
West Virginia (5)                                            
Buffalo (12)              Wes (5)                            
Maryland (4)                                                 
Valparaiso (13)           Mar (4)  Mar (4)  Ken (1)          
Butler (6)                Tex (11) Not (3)  Kan (2)  Ken (1) 
Texas (11)                                                   
Notre Dame (3)            Not (3)                            
Northeastern (14)                                            
Wichita St. (7)           Wic (7)  Kan (2)                   
Indiana (10)                                                 
Kansas (2)                Kan (2)                            
New Mexico St. (15)                                          
Total bracket energy: -10.3577160931

number of times this bracket happened: 11


MOST COMMON BRACKET FOR REGION west
Wisconsin (1)                                                
Coastal Carolina (16)     Wis (1)                            
Oregon (8)                                                   
Oklahoma St. (9)          Ore (8)  Wis (1)                   
Arkansas (5)                                                 
Wofford (12)              Ark (5)                            
North Carolina (4)                                           
Harvard (13)              Nor (4)  Nor (4)  Wis (1)          
Xavier (6)                Mis (11) Mis (11) Ohi (10) Wis (1) 
Mississippi (11)                                             
Baylor (3)                Geo (14)                           
Georgia St. (14)                                             
VCU (7)                   Ohi (10) Ohi (10)                  
Ohio St. (10)                                                
Arizona (2)               Tex (15)                           
Texas Southern (15)                                          
Total bracket energy: -9.30614882152

number of times this bracket happened: 13


MOST COMMON BRACKET FOR REGION south
Duke (1)                                                     
Robert Morris (16)        Duk (1)                            
San Diego St. (8)                                            
St. John's (9)            St. (9)  Duk (1)                   
Utah (5)                                                     
Stephen F. Austin (12)    Uta (5)                            
Georgetown (4)                                               
Eastern Washington (13)   Geo (4)  Uta (5)  Uta (5)          
SMU (6)                   UCL (11) UCL (11) Gon (2)  Uta (5) 
UCLA (11)                                                    
Iowa St. (3)              Iow (3)                            
UAB (14)                                                     
Iowa (7)                  Dav (10) Gon (2)                   
Davidson (10)                                                
Gonzaga (2)               Gon (2)                            
North Dakota St. (15)                                        
Total bracket energy: -9.78926932204

number of times this bracket happened: 9


MOST COMMON BRACKET FOR REGION east
Villanova (1)                                                
Lafayette (16)            Vil (1)                            
North Carolina St. (8)                                       
LSU (9)                   LSU (9)  Vil (1)                   
Northern Iowa (5)                                            
Wyoming (12)              Nor (5)                            
Louisville (4)                                               
UC Irvine (13)            Lou (4)  Lou (4)  Vil (1)          
Providence (6)            Day (11) Okl (3)  Vir (2)  Vir (2) 
Dayton (11)                                                  
Oklahoma (3)              Okl (3)                            
Albany (14)                                                  
Michigan St. (7)          Geo (10) Vir (2)                   
Georgia (10)                                                 
Virginia (2)              Vir (2)                            
Belmont (15)                                                 
Total bracket energy: -10.5204279257

number of times this bracket happened: 11


MOST COMMON BRACKET FOR FINAL FOUR
Kentucky (1)                               
Wisconsin (1)             Ken (1)          
Utah (5)                  Uta (5)  Ken (1) 
Virginia (2)                               
Total bracket energy: -1.78538460124

number of times this bracket happened: 176

So, not looking so hot for my Kansas, but the code is working.

Let's visualize the results, then make a table of results, as per 538.

In [10]:
# Basic visualization
for region in 'midwest west east south'.split():
    MMMC.showstats(results[region],description=region,newfig=True)

Here you can see the classic reason to do Monte Carlo samping in the first place: we're absolutely nowhere near fully sampling bracket space. You can tell this from each of the "Unique brackets over the trajectory" plots, none of which have leveled out.

Further, you can see that the strength distribution in each of the regions is pretty different. Check out the energy distribution histograms for visual proof.

Nice tables

And here's our set of tabulated results. Note that things look a little funny for the Final Four: runbracket2 runs the final four separate from the rest of the tourney. So, only four teams are allowed to have Championship and Win percentages above zero (the final four percentages are taken from the first chunk of the tourney).

In [11]:
h = HTML(MMMC.maketable(results))
h
Out[11]:
Team Region Rank 2nd Round 3rd Round Sweet 16 Elite 8 Final 4 Championship Win
Kentucky midwest 1 100.0 65.74 49.18 36.16 25.29 55.3 34.5
Wisconsin west 1 100.0 63.38 44.28 30.04 19.44 44.7 24.2
Virginia east 2 100.0 64.6 41.15 27.68 17.86 51.3 22.4
Utah south 5 100.0 58.29 35.86 20.96 12.19 48.7 18.9
Arizona west 2 100.0 62.35 42.23 28.64 18.74 0.0 0.0
Villanova east 1 100.0 65.73 44.05 29.84 18.61 0.0 0.0
Duke south 1 100.0 66.03 41.64 25.2 15.05 0.0 0.0
Gonzaga south 2 100.0 61.67 37.12 23.51 14.22 0.0 0.0
Kansas midwest 2 100.0 58.6 33.71 19.16 10.31 0.0 0.0
Oklahoma east 3 100.0 60.74 34.96 18.34 10.01 0.0 0.0
Northern Iowa east 5 100.0 65.23 37.58 19.72 9.97 0.0 0.0
Notre Dame midwest 3 100.0 57.37 32.47 18.57 9.65 0.0 0.0
Iowa St. south 3 100.0 57.01 33.22 17.32 9.07 0.0 0.0
Baylor west 3 100.0 56.3 33.23 16.77 8.79 0.0 0.0
North Carolina west 4 100.0 61.15 35.63 17.73 8.7 0.0 0.0
SMU south 6 100.0 53.2 28.16 14.61 7.62 0.0 0.0
Wichita St. midwest 7 100.0 51.59 28.31 15.41 7.54 0.0 0.0
Georgetown south 4 100.0 60.99 28.74 15.08 7.37 0.0 0.0
Michigan St. east 7 100.0 55.17 25.73 14.43 7.17 0.0 0.0
Iowa south 7 100.0 51.15 25.99 13.76 6.63 0.0 0.0
Louisville east 4 100.0 56.62 29.3 14.05 6.58 0.0 0.0
West Virginia midwest 5 100.0 55.77 31.54 13.3 6.47 0.0 0.0
Providence east 6 100.0 53.03 27.3 12.58 6.19 0.0 0.0
Butler midwest 6 100.0 53.61 27.03 13.43 6.05 0.0 0.0
Texas midwest 11 100.0 46.39 23.9 12.46 5.99 0.0 0.0
Xavier west 6 100.0 50.71 25.68 12.46 5.95 0.0 0.0
Ohio St. west 10 100.0 51.62 22.26 11.19 5.29 0.0 0.0
Arkansas west 5 100.0 57.67 28.22 11.83 5.14 0.0 0.0
Maryland midwest 4 100.0 55.25 25.84 11.73 5.14 0.0 0.0
VCU west 7 100.0 48.38 21.78 11.24 5.06 0.0 0.0
Oklahoma St. west 9 100.0 50.19 21.58 11.31 4.84 0.0 0.0
San Diego St. south 8 100.0 47.92 23.04 10.71 4.81 0.0 0.0
Oregon west 8 100.0 49.81 20.61 10.42 4.73 0.0 0.0
Cincinnati midwest 8 100.0 49.74 19.41 9.93 4.72 0.0 0.0
Davidson south 10 100.0 48.85 23.46 10.7 4.54 0.0 0.0
LSU east 9 100.0 53.31 24.25 11.58 4.53 0.0 0.0
North Carolina St. east 8 100.0 46.69 21.93 10.18 4.35 0.0 0.0
Stephen F. Austin south 12 100.0 41.71 21.31 9.08 4.3 0.0 0.0
UCLA south 11 100.0 46.8 21.71 9.51 4.22 0.0 0.0
St. John's south 9 100.0 52.08 22.83 9.84 4.17 0.0 0.0
Purdue midwest 9 100.0 50.26 19.41 8.86 3.76 0.0 0.0
Buffalo midwest 12 100.0 44.23 23.13 8.76 3.55 0.0 0.0
Georgia east 10 100.0 44.83 19.41 8.76 3.53 0.0 0.0
Dayton east 11 100.0 46.97 22.59 8.55 3.52 0.0 0.0
Mississippi west 11 100.0 49.29 21.45 9.01 3.51 0.0 0.0
Indiana midwest 10 100.0 48.41 22.29 8.57 3.14 0.0 0.0
Valparaiso midwest 13 100.0 44.75 19.49 7.86 3.1 0.0 0.0
Georgia St. west 14 100.0 43.7 19.64 6.96 2.67 0.0 0.0
Harvard west 13 100.0 38.85 18.65 6.98 2.36 0.0 0.0
Wofford west 12 100.0 42.33 17.5 6.84 2.33 0.0 0.0
New Mexico St. midwest 15 100.0 41.4 15.69 6.3 2.26 0.0 0.0
UC Irvine east 13 100.0 43.38 19.57 6.64 2.17 0.0 0.0
Northeastern midwest 14 100.0 42.63 16.6 6.1 2.17 0.0 0.0
UAB south 14 100.0 42.99 16.91 6.09 2.02 0.0 0.0
Wyoming east 12 100.0 34.77 13.55 5.06 1.8 0.0 0.0
Albany east 14 100.0 39.26 15.15 4.86 1.59 0.0 0.0
Eastern Washington south 13 100.0 39.01 14.09 5.06 1.42 0.0 0.0
Coastal Carolina west 16 100.0 36.62 13.53 4.85 1.42 0.0 0.0
North Dakota St. south 15 100.0 38.33 13.43 4.5 1.35 0.0 0.0
Belmont east 15 100.0 35.4 13.71 4.8 1.28 0.0 0.0
Texas Southern west 15 100.0 37.65 13.73 3.73 1.03 0.0 0.0
Robert Morris south 16 100.0 33.97 12.49 4.07 1.02 0.0 0.0
Hampton midwest 16 100.0 34.26 12.0 3.4 0.86 0.0 0.0
Lafayette east 16 100.0 34.27 9.77 2.93 0.84 0.0 0.0

Given those concerns, let's just run runbracket1 and look at the results

In [12]:
results = MMMC.runbracket1(ntrials=10000,T=1.5)
Lowest energy bracket
Kentucky (1)                                                                   
Hampton (16)              Ken (1)                                              
Cincinnati (8)                                                                 
Purdue (9)                Cin (8)  Ken (1)                                     
West Virginia (5)                                                              
Buffalo (12)              Wes (5)                                              
Maryland (4)                                                                   
Valparaiso (13)           Mar (4)  Wes (5)  Ken (1)                            
Butler (6)                                                                     
Texas (11)                Tex (11)                                             
Notre Dame (3)                                                                 
Northeastern (14)         Not (3)  Not (3)                                     
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  Not (3)  Ken (1)                   
Duke (1)                                                                       
Robert Morris (16)        Duk (1)                                              
San Diego St. (8)                                                              
St. John's (9)            San (8)  Duk (1)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Uta (5)                                              
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Uta (5)  Duk (1)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  Iow (3)                                     
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Gon (2)  Ken (1)          
Wisconsin (1)             Wis (1)  Wis (1)  Wis (1)  Ari (2)  Ari (2)  Ken (1) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Ark (5)  Nor (4)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Bay (3)  Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Vil (1)  Vil (1)  Vil (1)  Vir (2)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Nor (5)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Okl (3)  Vir (2)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Vir (2)                                     
Georgia (10)                                                                   
Virginia (2)              Vir (2)                                              
Belmont (15)                                                                   
Total bracket energy: -44.697443432

Most common bracket (7)
Kentucky (1)                                                                   
Hampton (16)              Ham (16)                                             
Cincinnati (8)                                                                 
Purdue (9)                Cin (8)  Cin (8)                                     
West Virginia (5)                                                              
Buffalo (12)              Wes (5)                                              
Maryland (4)                                                                   
Valparaiso (13)           Mar (4)  Wes (5)  Cin (8)                            
Butler (6)                                                                     
Texas (11)                But (6)                                              
Notre Dame (3)                                                                 
Northeastern (14)         Nor (14) But (6)                                     
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  But (6)  But (6)                   
Duke (1)                                                                       
Robert Morris (16)        Rob (16)                                             
San Diego St. (8)                                                              
St. John's (9)            San (8)  San (8)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Uta (5)                                              
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Geo (4)  Geo (4)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  Iow (3)                                     
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Gon (2)  Gon (2)          
Wisconsin (1)             Wis (1)  Ore (8)  Ore (8)  Ari (2)  Ari (2)  Ari (2) 
Coastal Carolina (16)                                                          
Oregon (8)                Ore (8)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Ark (5)  Nor (4)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Xav (6)  Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Laf (16) Nor (8)  Nor (5)  Mic (7)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Nor (5)                                     
Wyoming (12)                                                                   
Louisville (4)            UC  (13)                                             
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Pro (6)  Mic (7)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Mic (7)                                     
Georgia (10)                                                                   
Virginia (2)              Bel (15)                                             
Belmont (15)                                                                   
Total bracket energy: -38.100050778

In [13]:
h = HTML(MMMC.maketable(results))
h
Out[13]:
Team Region Rank 2nd Round 3rd Round Sweet 16 Elite 8 Final 4 Championship Win
Kentucky midwest 1 100.0 58.69 44.77 32.58 23.98 16.41 10.87
Villanova east 1 100.0 65.53 45.71 32.0 20.87 13.77 8.67
Arizona west 2 100.0 62.57 44.59 33.54 20.75 13.11 8.03
Wisconsin west 1 100.0 67.67 43.02 27.42 18.12 11.35 7.15
Virginia east 2 100.0 63.76 43.67 30.86 19.05 11.08 6.59
Gonzaga south 2 100.0 57.98 39.95 25.46 15.02 9.09 4.78
Utah south 5 100.0 57.92 34.4 21.04 13.19 6.8 3.66
Duke south 1 100.0 60.1 35.45 18.48 11.41 6.73 3.49
Notre Dame midwest 3 100.0 61.12 33.93 19.4 11.17 5.57 2.8
Oklahoma east 3 100.0 62.0 37.74 17.64 10.01 5.26 2.64
Kansas midwest 2 100.0 66.57 35.81 19.02 9.97 5.45 2.59
Northern Iowa east 5 100.0 62.27 35.11 18.67 10.05 4.82 2.38
Iowa St. south 3 100.0 60.06 36.95 20.04 11.12 5.24 2.36
Wichita St. midwest 7 100.0 56.94 31.91 19.06 8.76 4.18 2.17
North Carolina west 4 100.0 58.35 32.3 18.51 9.54 4.23 1.82
SMU south 6 100.0 55.35 29.26 16.91 8.61 3.79 1.61
Baylor west 3 100.0 56.63 29.51 11.82 7.12 3.63 1.56
Georgetown south 4 100.0 56.75 29.58 15.75 7.23 3.24 1.42
Texas midwest 11 100.0 58.28 27.89 14.8 6.87 3.3 1.39
Michigan St. east 7 100.0 59.67 27.43 14.51 6.41 2.98 1.29
Ohio St. west 10 100.0 50.4 23.8 12.61 6.46 3.19 1.27
West Virginia midwest 5 100.0 58.66 33.41 13.82 6.62 2.97 1.17
Arkansas west 5 100.0 57.66 31.16 12.71 5.34 2.38 1.16
Louisville east 4 100.0 55.49 31.49 15.28 7.18 3.1 1.14
Iowa south 7 100.0 55.68 25.06 9.92 5.49 2.67 1.13
Butler midwest 6 100.0 41.72 24.06 11.13 6.36 2.77 1.12
Xavier west 6 100.0 56.43 31.36 14.27 6.71 2.44 1.12
San Diego St. south 8 100.0 52.15 29.17 12.83 6.64 2.88 1.03
Cincinnati midwest 8 100.0 50.58 20.36 12.28 5.35 2.38 0.9
Davidson south 10 100.0 44.32 15.66 6.98 3.12 1.69 0.73
VCU west 7 100.0 49.6 19.03 8.63 4.28 1.81 0.71
Maryland midwest 4 100.0 50.87 22.22 9.44 3.6 1.86 0.7
Oregon west 8 100.0 58.41 24.89 10.74 4.78 1.74 0.66
Stephen F. Austin south 12 100.0 42.08 23.6 11.54 4.51 1.72 0.63
St. John's south 9 100.0 47.85 18.82 9.51 3.97 1.83 0.62
Dayton east 11 100.0 48.81 21.64 8.96 4.31 1.79 0.58
Mississippi west 11 100.0 43.57 21.86 8.38 3.0 1.41 0.55
Purdue midwest 9 100.0 49.42 20.37 9.7 3.31 1.33 0.53
Buffalo midwest 12 100.0 41.34 21.68 7.93 2.77 1.19 0.52
Valparaiso midwest 13 100.0 49.13 22.69 9.34 4.12 1.49 0.51
Providence east 6 100.0 51.19 26.68 11.18 4.28 1.4 0.48
LSU east 9 100.0 48.81 15.79 7.11 2.58 0.92 0.47
Georgia east 10 100.0 40.33 15.12 7.58 2.95 1.3 0.46
Oklahoma St. west 9 100.0 41.59 18.78 9.24 3.35 1.1 0.45
Wofford west 12 100.0 42.34 16.57 7.37 2.63 0.94 0.44
Indiana midwest 10 100.0 43.06 18.98 8.46 3.72 1.2 0.43
UCLA south 11 100.0 44.65 21.4 8.22 2.72 1.26 0.42
North Carolina St. east 8 100.0 51.19 22.31 8.99 3.55 1.45 0.4
Harvard west 13 100.0 41.65 19.97 8.53 2.49 0.8 0.35
UC Irvine east 13 100.0 44.51 20.52 7.19 2.44 0.97 0.26
Coastal Carolina west 16 100.0 32.33 13.31 5.48 2.24 0.57 0.24
Georgia St. west 14 100.0 43.37 17.27 6.04 1.92 0.47 0.21
UAB south 14 100.0 39.94 12.39 4.94 1.57 0.73 0.17
Eastern Washington south 13 100.0 43.25 12.42 3.88 1.27 0.55 0.17
Lafayette east 16 100.0 34.47 16.19 7.11 2.28 0.5 0.15
Wyoming east 12 100.0 37.73 12.88 3.65 1.12 0.4 0.15
Albany east 14 100.0 38.0 13.94 4.57 1.84 0.53 0.11
Robert Morris south 16 100.0 39.9 16.56 6.97 2.39 0.5 0.11
Belmont east 15 100.0 36.24 13.78 4.7 1.08 0.31 0.11
Texas Southern west 15 100.0 37.43 12.58 4.71 1.27 0.25 0.11
Northeastern midwest 14 100.0 38.88 14.12 4.28 0.99 0.29 0.09
Hampton midwest 16 100.0 41.31 14.5 4.91 1.25 0.25 0.07
North Dakota St. south 15 100.0 42.02 19.33 7.53 1.74 0.33 0.05
New Mexico St. midwest 15 100.0 33.43 13.3 3.85 1.16 0.31 0.05

That is, indeed, high temperature! Let's try it with a bunch more runs

In [14]:
results = MMMC.runbracket1(ntrials=100000,T=1.5)
Lowest energy bracket
Kentucky (1)                                                                   
Hampton (16)              Ken (1)                                              
Cincinnati (8)                                                                 
Purdue (9)                Cin (8)  Ken (1)                                     
West Virginia (5)                                                              
Buffalo (12)              Wes (5)                                              
Maryland (4)                                                                   
Valparaiso (13)           Mar (4)  Wes (5)  Ken (1)                            
Butler (6)                                                                     
Texas (11)                Tex (11)                                             
Notre Dame (3)                                                                 
Northeastern (14)         Not (3)  Not (3)                                     
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  Not (3)  Ken (1)                   
Duke (1)                                                                       
Robert Morris (16)        Duk (1)                                              
San Diego St. (8)                                                              
St. John's (9)            San (8)  Duk (1)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Uta (5)                                              
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Uta (5)  Duk (1)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  Iow (3)                                     
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Gon (2)  Ken (1)          
Wisconsin (1)             Wis (1)  Wis (1)  Wis (1)  Ari (2)  Ari (2)  Ken (1) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Ark (5)  Nor (4)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Bay (3)  Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Vil (1)  Vil (1)  Vil (1)  Vir (2)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Nor (5)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Okl (3)  Vir (2)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Vir (2)                                     
Georgia (10)                                                                   
Virginia (2)              Vir (2)                                              
Belmont (15)                                                                   
Total bracket energy: -44.697443432

Most common bracket (9)
Kentucky (1)                                                                   
Hampton (16)              Ken (1)                                              
Cincinnati (8)                                                                 
Purdue (9)                Cin (8)  Ken (1)                                     
West Virginia (5)                                                              
Buffalo (12)              Buf (12)                                             
Maryland (4)                                                                   
Valparaiso (13)           Val (13) Buf (12) Ken (1)                            
Butler (6)                                                                     
Texas (11)                Tex (11)                                             
Notre Dame (3)                                                                 
Northeastern (14)         Nor (14) Nor (14)                                    
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Wic (7)  Wic (7)  Ken (1)                   
Duke (1)                                                                       
Robert Morris (16)        Rob (16)                                             
San Diego St. (8)                                                              
St. John's (9)            St. (9)  St. (9)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Ste (12)                                             
Georgetown (4)                                                                 
Eastern Washington (13)   Eas (13) Eas (13) St. (9)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  Iow (3)                                     
Iowa (7)                                                                       
Davidson (10)             Dav (10)                                             
Gonzaga (2)                                                                    
North Dakota St. (15)     Nor (15) Dav (10) Iow (3)  Iow (3)  Ken (1)          
Wisconsin (1)             Wis (1)  Wis (1)  Ark (5)  Ark (5)  Vir (2)  Vir (2) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Ark (5)  Ark (5)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Bay (3)  Bay (3)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ohi (10)                                    
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Vil (1)  Vil (1)  Lou (4)  Vir (2)                   
Lafayette (16)                                                                 
North Carolina St. (8)    LSU (9)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Lou (4)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Day (11) Alb (14) Vir (2)                            
Dayton (11)                                                                    
Oklahoma (3)              Alb (14)                                             
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Vir (2)                                     
Georgia (10)                                                                   
Virginia (2)              Vir (2)                                              
Belmont (15)                                                                   
Total bracket energy: -37.1483683417

In [15]:
h = HTML(MMMC.maketable(results))
h
Out[15]:
Team Region Rank 2nd Round 3rd Round Sweet 16 Elite 8 Final 4 Championship Win
Kentucky midwest 1 100.0 67.848 49.189 36.444 26.79 18.949 12.828
Arizona west 2 100.0 64.114 42.628 29.248 18.838 12.256 7.805
Wisconsin west 1 100.0 63.368 42.146 28.131 17.705 11.083 6.862
Villanova east 1 100.0 64.314 43.814 28.618 17.425 10.64 6.365
Virginia east 2 100.0 62.189 39.522 26.285 17.038 10.321 6.183
Gonzaga south 2 100.0 61.672 38.305 24.17 14.417 8.149 4.538
Duke south 1 100.0 64.407 40.263 24.75 14.927 8.061 4.37
Oklahoma east 3 100.0 62.257 38.005 20.227 11.564 5.906 3.04
Utah south 5 100.0 56.188 34.295 19.469 10.846 5.576 2.832
Notre Dame midwest 3 100.0 60.692 33.504 18.74 9.665 5.061 2.473
Kansas midwest 2 100.0 59.732 33.553 18.682 9.098 4.834 2.329
Northern Iowa east 5 100.0 58.928 34.06 17.509 9.302 4.604 2.239
North Carolina west 4 100.0 61.148 36.852 19.086 9.535 4.682 2.223
Iowa St. south 3 100.0 59.581 33.405 18.131 9.562 4.647 2.206
Baylor west 3 100.0 58.599 34.549 16.969 9.177 4.63 2.202
Wichita St. midwest 7 100.0 56.45 31.005 17.065 8.454 4.378 2.104
Louisville east 4 100.0 56.525 30.188 15.331 6.889 3.158 1.413
Texas midwest 11 100.0 52.783 28.184 14.644 6.604 3.115 1.372
SMU south 6 100.0 54.37 29.068 14.489 6.789 3.097 1.361
Georgetown south 4 100.0 58.912 28.238 14.08 6.915 3.015 1.311
Michigan St. east 7 100.0 55.477 25.467 12.998 6.372 2.951 1.292
Ohio St. west 10 100.0 50.1 23.617 12.158 5.813 2.825 1.18
West Virginia midwest 5 100.0 52.507 28.241 12.365 5.829 2.809 1.09
Butler midwest 6 100.0 47.217 24.427 12.449 5.551 2.47 1.088
Iowa south 7 100.0 51.023 24.916 12.842 6.02 2.641 1.084
Arkansas west 5 100.0 58.71 28.683 13.046 5.712 2.619 1.074
San Diego St. south 8 100.0 52.821 25.262 12.281 5.721 2.585 1.026
Xavier west 6 100.0 52.833 26.252 11.254 5.424 2.318 0.966
VCU west 7 100.0 49.9 22.655 11.034 5.322 2.272 0.915
Davidson south 10 100.0 48.977 23.746 11.654 5.393 2.259 0.887
Maryland midwest 4 100.0 54.293 27.841 10.992 4.69 2.124 0.852
Cincinnati midwest 8 100.0 49.481 21.064 10.878 4.977 2.137 0.809
Providence east 6 100.0 50.52 23.839 10.379 4.86 1.981 0.778
North Carolina St. east 8 100.0 52.32 22.887 11.35 4.862 1.818 0.717
Oregon west 8 100.0 50.892 22.58 10.956 4.322 1.844 0.7
St. John's south 9 100.0 47.179 22.962 10.652 4.691 1.904 0.687
Stephen F. Austin south 12 100.0 43.812 22.666 9.837 4.551 1.859 0.674
Oklahoma St. west 9 100.0 49.108 21.697 10.965 4.484 1.736 0.666
Georgia east 10 100.0 44.523 21.647 10.237 4.523 1.723 0.65
Dayton east 11 100.0 49.48 22.922 9.937 4.078 1.609 0.594
LSU east 9 100.0 47.68 20.72 9.64 3.961 1.595 0.584
UCLA south 11 100.0 45.63 22.264 9.285 3.921 1.548 0.559
Mississippi west 11 100.0 47.167 20.522 8.975 3.684 1.45 0.53
Buffalo midwest 12 100.0 47.493 23.217 8.816 3.647 1.519 0.525
Purdue midwest 9 100.0 50.519 19.548 9.389 3.926 1.513 0.503
Indiana midwest 10 100.0 43.55 19.487 8.32 3.427 1.3 0.446
Valparaiso midwest 13 100.0 45.707 20.701 8.106 3.225 1.236 0.43
Georgia St. west 14 100.0 41.401 18.677 6.87 2.788 0.94 0.321
Harvard west 13 100.0 38.852 17.912 7.032 2.36 0.872 0.269
UC Irvine east 13 100.0 43.475 18.635 7.212 2.373 0.699 0.232
Wofford west 12 100.0 41.29 16.553 5.962 2.169 0.686 0.214
Wyoming east 12 100.0 41.072 17.117 6.21 2.154 0.697 0.194
New Mexico St. midwest 15 100.0 40.268 15.955 5.428 1.908 0.604 0.167
UAB south 14 100.0 40.419 15.263 4.757 1.649 0.545 0.153
Northeastern midwest 14 100.0 39.308 13.885 4.672 1.41 0.479 0.145
Belmont east 15 100.0 37.811 13.364 4.884 1.732 0.534 0.144
Albany east 14 100.0 37.743 15.234 5.053 1.602 0.465 0.14
Eastern Washington south 13 100.0 41.088 14.801 5.269 1.861 0.544 0.139
Coastal Carolina west 16 100.0 36.632 13.577 4.822 1.707 0.492 0.126
North Dakota St. south 15 100.0 38.328 13.033 4.672 1.602 0.478 0.121
Lafayette east 16 100.0 35.686 12.579 4.13 1.265 0.31 0.078
Robert Morris south 16 100.0 35.593 11.513 3.662 1.135 0.33 0.073
Texas Southern west 15 100.0 35.886 11.1 3.492 0.96 0.284 0.066
Hampton midwest 16 100.0 32.152 10.199 3.01 0.799 0.234 0.056
In [17]:
MMMC.showstats(results['all'],description='full run',newfig=True)

Not bad! (Also, note the shape of the energy distribution, and the fact taht the "Unique brackets" group is still basically linear!)

Who wins?

But let's be honest for a minute. We all know Kansas is going to win. What do the brackets where that happened look like?

In [18]:
goodbrackets = [i for i in results['all'].brackets if i.bracket[-1][0] == 'Kansas']
In [19]:
from MarchMadnessMonteCarlo.Brackets import Stats
lb, mcb, mcb_count, unique_brackets, lowest_sightings = Stats.gather_uniquestats(goodbrackets)
sr = MMMC.SimulationResults(goodbrackets,unique_brackets,lb,lowest_sightings,mcb,mcb_count)
trueresults = {'all':sr}
In [20]:
h = HTML(MMMC.maketable(trueresults))
h
Out[20]:
Team Region Rank 2nd Round 3rd Round Sweet 16 Elite 8 Final 4 Championship Win
Kansas midwest 2 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Virginia east 2 100.0 61.743237441 39.7595534564 25.7621296694 17.1318162301 9.01674538429 0.0
Villanova east 1 100.0 68.05495921 47.8316874195 30.8286818377 17.3894375268 8.71618720481 0.0
Wisconsin west 1 100.0 62.0008587377 39.5019321597 24.9033920137 14.2121082009 8.2438814942 0.0
Oklahoma east 3 100.0 63.8042078145 42.7221983684 22.4130528124 13.0957492486 7.7715757836 0.0
Arizona west 2 100.0 62.9025332761 37.5697724345 24.4310863031 14.6414770288 7.72863890082 0.0
Northern Iowa east 5 100.0 59.5534564191 37.1404036067 18.3769858308 10.5195362817 5.45298411335 0.0
Baylor west 3 100.0 55.3456419064 30.9145556033 15.5431515672 9.05968226707 4.89480463718 0.0
North Carolina west 4 100.0 58.8664662945 35.4229282954 19.321597252 9.61786174324 4.8518677544 0.0
Michigan St. east 7 100.0 57.4924860455 26.4920566767 12.7951910691 6.26878488622 3.47788750537 0.0
Louisville east 4 100.0 58.3082868184 29.2400171748 14.1262344354 6.82696436239 3.30613997424 0.0
Xavier west 6 100.0 50.1502790897 25.1610133104 11.9793902963 7.29927007299 3.17732932589 0.0
Oregon west 8 100.0 52.2112494633 22.4559896951 12.2799484757 5.28123658222 2.9197080292 0.0
Ohio St. west 10 100.0 50.493774152 26.8355517389 13.6539287248 5.53885787892 2.79089738085 0.0
Arkansas west 5 100.0 60.3263203091 28.9823958781 11.8935165307 5.88235294118 2.70502361529 0.0
VCU west 7 100.0 49.506225848 23.5723486475 12.1082009446 5.83941605839 2.53327608416 0.0
Providence east 6 100.0 51.1807642765 21.4255045084 9.44611421211 4.46543580936 2.44740231859 0.0
North Carolina St. east 8 100.0 55.3885787892 24.6887075998 13.009875483 4.8518677544 2.14684413912 0.0
Mississippi west 11 100.0 49.8497209103 24.6887075998 10.9489051095 4.76599398884 1.9321597252 0.0
Dayton east 11 100.0 48.8192357235 20.0085873766 9.14555603263 4.03606698154 1.9321597252 0.0
Oklahoma St. west 9 100.0 47.7887505367 22.0695577501 11.5500214684 4.20781451267 1.71747531129 0.0
Georgia St. west 14 100.0 44.6543580936 19.2357234865 7.64276513525 3.82138256763 1.71747531129 0.0
Georgia east 10 100.0 42.5075139545 20.6097037355 9.40317732933 3.60669815371 1.63160154573 0.0
Harvard west 13 100.0 41.1335337055 20.0515242593 8.41562902533 3.34907685702 1.58866466295 0.0
Wofford west 12 100.0 39.6736796909 15.5431515672 5.88235294118 2.9197080292 1.37398024903 0.0
LSU east 9 100.0 44.6114212108 18.3769858308 8.75912408759 3.69257191928 1.33104336625 0.0
Belmont east 15 100.0 38.256762559 13.1386861314 6.65521683126 2.10390725633 1.07342206956 0.0
Coastal Carolina west 16 100.0 37.9991412623 15.972520395 5.75354229283 2.36152855303 0.901674538429 0.0
Wyoming east 12 100.0 40.4465435809 17.1747531129 5.5817947617 1.71747531129 0.686990124517 0.0
UC Irvine east 13 100.0 41.6917131816 16.4448261056 5.66766852726 1.63160154573 0.515242593388 0.0
Texas Southern west 15 100.0 37.0974667239 12.022327179 3.69257191928 1.2022327179 0.515242593388 0.0
Lafayette east 16 100.0 31.94504079 9.10261914985 3.6496350365 1.15929583512 0.472305710605 0.0
Albany east 14 100.0 36.1957921855 15.8437097467 4.3795620438 1.50279089738 0.429368827823 0.0
Duke south 1 100.0 61.8291112065 40.1459854015 22.1124946329 13.5680549592 0.0 0.0
Gonzaga south 2 100.0 55.5173894375 35.0364963504 21.4684413912 10.0472305711 0.0 0.0
Utah south 5 100.0 54.8733361958 30.184628596 15.8866466295 9.7037355088 0.0 0.0
Iowa St. south 3 100.0 59.167024474 32.546157149 17.9905538858 9.36024044654 0.0 0.0
Georgetown south 4 100.0 58.780592529 30.184628596 16.1013310434 8.54443967368 0.0 0.0
St. John's south 9 100.0 49.0339201374 23.7011592958 13.7827393731 6.86990124517 0.0 0.0
Stephen F. Austin south 12 100.0 45.1266638042 24.9033920137 10.261914985 6.39759553456 0.0 0.0
SMU south 6 100.0 54.6157148991 27.6513525118 13.8686131387 6.35465865178 0.0 0.0
Davidson south 10 100.0 48.6045513096 24.3022756548 12.022327179 5.5817947617 0.0 0.0
San Diego St. south 8 100.0 50.9660798626 22.6706741091 12.2799484757 5.49592099614 0.0 0.0
Iowa south 7 100.0 51.3954486904 24.4740231859 12.5375697724 5.45298411335 0.0 0.0
UCLA south 11 100.0 45.3842851009 23.1000429369 11.1635895234 5.06655216831 0.0 0.0
Eastern Washington south 13 100.0 41.219407471 14.7273507943 5.88235294118 2.49033920137 0.0 0.0
North Dakota St. south 15 100.0 44.4826105625 16.1872048089 6.18291112065 2.06097037355 0.0 0.0
UAB south 14 100.0 40.832975526 16.7024474023 4.76599398884 1.84628595964 0.0 0.0
Robert Morris south 16 100.0 38.1708887935 13.4821811936 3.69257191928 1.15929583512 0.0 0.0
Kentucky midwest 1 100.0 61.743237441 39.6307428081 21.8119364534 0.0 0.0 0.0
Maryland midwest 4 100.0 59.4246457707 31.2151137827 16.3160154573 0.0 0.0 0.0
Purdue midwest 9 100.0 55.7320738514 27.4796049807 13.3963074281 0.0 0.0 0.0
Buffalo midwest 12 100.0 47.7458136539 24.4310863031 12.2799484757 0.0 0.0 0.0
West Virginia midwest 5 100.0 52.2541863461 23.7440961786 11.8076427651 0.0 0.0 0.0
Valparaiso midwest 13 100.0 40.5753542293 20.6097037355 9.9613568055 0.0 0.0 0.0
Cincinnati midwest 8 100.0 44.2679261486 20.5238299699 9.9613568055 0.0 0.0 0.0
Hampton midwest 16 100.0 38.256762559 12.3658222413 4.46543580936 0.0 0.0 0.0
Texas midwest 11 100.0 51.5671962216 30.9145556033 0.0 0.0 0.0 0.0
Notre Dame midwest 3 100.0 58.4800343495 30.2275654787 0.0 0.0 0.0 0.0
Butler midwest 6 100.0 48.4328037784 23.0141691713 0.0 0.0 0.0 0.0
Northeastern midwest 14 100.0 41.5199656505 15.8437097467 0.0 0.0 0.0 0.0
Wichita St. midwest 7 100.0 55.2168312581 0.0 0.0 0.0 0.0 0.0
Indiana midwest 10 100.0 44.7831687419 0.0 0.0 0.0 0.0 0.0
New Mexico St. midwest 15 100.0 0.0 0.0 0.0 0.0 0.0 0.0
In [21]:
sr.most_common_bracket
Out[21]:
Kentucky (1)                                                                   
Hampton (16)              Ham (16)                                             
Cincinnati (8)                                                                 
Purdue (9)                Pur (9)  Pur (9)                                     
West Virginia (5)                                                              
Buffalo (12)              Buf (12)                                             
Maryland (4)                                                                   
Valparaiso (13)           Val (13) Buf (12) Pur (9)                            
Butler (6)                                                                     
Texas (11)                Tex (11)                                             
Notre Dame (3)                                                                 
Northeastern (14)         Not (3)  Not (3)                                     
Wichita St. (7)                                                                
Indiana (10)              Ind (10)                                             
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  Kan (2)  Kan (2)                   
Duke (1)                                                                       
Robert Morris (16)        Duk (1)                                              
San Diego St. (8)                                                              
St. John's (9)            St. (9)  Duk (1)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Ste (12)                                             
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Ste (12) Ste (12)                           
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  SMU (6)                                     
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Ste (12) Kan (2)          
Wisconsin (1)             Wis (1)  Okl (9)  Wof (12) Wof (12) Okl (3)  Kan (2) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Wof (12) Wof (12)                                    
Wofford (12)                                                                   
North Carolina (4)        Har (13)                                             
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Bay (3)  Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Laf (16) Nor (8)  Nor (8)  Okl (3)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Wyo (12) Lou (4)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Okl (3)  Okl (3)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Mic (7)                                     
Georgia (10)                                                                   
Virginia (2)              Vir (2)                                              
Belmont (15)                                                                   
Total bracket energy: -35.9796902956
In [22]:
sr.most_common_bracket_count
Out[22]:
4

Huh. Well, that's not exactly my pick for best bracket, but there you have it.

Excersizes for the reader:

  • make a better energy function, using maybe a weighted average of different rankings. I slurped in KenPom and Jeff Sagarin, but you could add your own.
  • come up with a better "hometown wins" version. E.g. explicitly check for KU, and tweak the rankings.
In [46]:
from MarchMadnessMonteCarlo import RankingsAndStrength as RAS
strength = RAS.kenpom['Pyth']
jsstrength = RAS.sagarin['Rating']

def weighted_KU_energy_game(winner, loser):
    if winner == 'Kansas':
        win_pct = 0.99
    elif loser == 'Kansas':
        win_pct = 0.01
    else:
        A,B = strength[winner],strength[loser]
        # see http://207.56.97.150/articles/playoff2002.htm
        kenpom = (A-A*B)/(A+B-2*A*B)
        A,B = jsstrength[winner]/100, jsstrength[loser]/100
        A,B = min(A,0.9999),min(B,0.9999)
        sagarin = (A-A*B)/(A+B-2*A*B)
        win_pct = 0.70*kenpom + 0.30*sagarin
    return -win_pct
In [48]:
MMMC.set_energy_function(weighted_KU_energy_game)
In [55]:
results = MMMC.runbracket1(ntrials=100000,T=1.5)
Lowest energy bracket
Kentucky (1)                                                                   
Hampton (16)              Ken (1)                                              
Cincinnati (8)                                                                 
Purdue (9)                Cin (8)  Ken (1)                                     
West Virginia (5)                                                              
Buffalo (12)              Wes (5)                                              
Maryland (4)                                                                   
Valparaiso (13)           Mar (4)  Wes (5)  Ken (1)                            
Butler (6)                                                                     
Texas (11)                Tex (11)                                             
Notre Dame (3)                                                                 
Northeastern (14)         Not (3)  Not (3)                                     
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  Kan (2)  Kan (2)                   
Duke (1)                                                                       
Robert Morris (16)        Duk (1)                                              
San Diego St. (8)                                                              
St. John's (9)            San (8)  Duk (1)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Uta (5)                                              
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Uta (5)  Duk (1)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  Iow (3)                                     
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Duk (1)  Kan (2)          
Wisconsin (1)             Wis (1)  Wis (1)  Wis (1)  Ari (2)  Ari (2)  Kan (2) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Ark (5)  Nor (4)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Bay (3)  Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Vil (1)  Vil (1)  Vil (1)  Vir (2)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Nor (5)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Okl (3)  Vir (2)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Vir (2)                                     
Georgia (10)                                                                   
Virginia (2)              Vir (2)                                              
Belmont (15)                                                                   
Total bracket energy: -45.7476455682

Most common bracket (13)
Kentucky (1)                                                                   
Hampton (16)              Ken (1)                                              
Cincinnati (8)                                                                 
Purdue (9)                Pur (9)  Pur (9)                                     
West Virginia (5)                                                              
Buffalo (12)              Buf (12)                                             
Maryland (4)                                                                   
Valparaiso (13)           Mar (4)  Buf (12) Buf (12)                           
Butler (6)                                                                     
Texas (11)                But (6)                                              
Notre Dame (3)                                                                 
Northeastern (14)         Not (3)  Not (3)                                     
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  Kan (2)  Kan (2)                   
Duke (1)                                                                       
Robert Morris (16)        Duk (1)                                              
San Diego St. (8)                                                              
St. John's (9)            San (8)  San (8)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Ste (12)                                             
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Ste (12) San (8)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  UAB (14) UAB (14)                                    
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Gon (2)  Kan (2)          
Wisconsin (1)             Coa (16) Okl (9)  Nor (4)  Ari (2)  Ari (2)  Kan (2) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Wof (12) Nor (4)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Mis (11) Mis (11) Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   VCU (7)  Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Vil (1)  Nor (8)  Nor (8)  Nor (8)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Lou (4)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Okl (3)  Mic (7)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Mic (7)                                     
Georgia (10)                                                                   
Virginia (2)              Bel (15)                                             
Belmont (15)                                                                   
Total bracket energy: -38.9077921757

In [56]:
MMMC.showstats(results['all'],description='full run',newfig=True)
In [57]:
h = HTML(MMMC.maketable(results))
h
Out[57]:
Team Region Rank 2nd Round 3rd Round Sweet 16 Elite 8 Final 4 Championship Win
Kansas midwest 2 100.0 71.415 55.343 44.429 35.557 28.44 22.414
Kentucky midwest 1 100.0 69.107 50.644 37.446 18.891 13.788 9.643
Arizona west 2 100.0 65.022 43.251 29.626 19.515 12.801 6.535
Wisconsin west 1 100.0 64.476 43.066 29.09 17.775 11.201 5.467
Virginia east 2 100.0 65.076 42.901 28.131 17.706 10.58 5.127
Villanova east 1 100.0 63.795 42.213 27.546 17.299 10.203 4.848
Duke south 1 100.0 63.066 40.645 25.96 15.924 6.731 3.64
Gonzaga south 2 100.0 62.557 37.869 24.019 14.332 6.157 3.354
Utah south 5 100.0 54.831 33.958 18.603 10.684 4.171 2.105
North Carolina west 4 100.0 55.798 35.168 18.059 9.103 4.648 1.857
Oklahoma east 3 100.0 57.558 32.89 16.602 8.699 4.476 1.856
Iowa St. south 3 100.0 58.435 32.387 17.002 9.079 3.519 1.74
Baylor west 3 100.0 57.425 33.143 16.356 8.703 4.328 1.714
Northern Iowa east 5 100.0 58.362 33.246 16.966 8.687 4.148 1.653
Notre Dame midwest 3 100.0 58.973 33.146 12.23 5.888 3.064 1.459
Louisville east 4 100.0 56.572 30.948 15.201 7.349 3.329 1.289
Wichita St. midwest 7 100.0 55.022 19.553 10.731 5.057 2.614 1.235
Michigan St. east 7 100.0 53.365 25.012 13.642 6.637 3.041 1.157
Ohio St. west 10 100.0 50.97 23.373 12.8 6.241 2.876 1.053
Georgetown south 4 100.0 56.351 27.242 13.727 6.735 2.386 1.046
SMU south 6 100.0 50.545 26.664 12.96 6.318 2.197 0.98
Iowa south 7 100.0 53.959 26.803 13.763 6.446 2.273 0.979
Xavier west 6 100.0 54.002 25.958 11.422 5.532 2.541 0.936
West Virginia midwest 5 100.0 55.439 30.487 13.026 4.733 2.187 0.896
Providence east 6 100.0 54.017 27.975 12.282 5.736 2.417 0.88
Arkansas west 5 100.0 55.043 25.753 11.997 5.229 2.275 0.845
Texas midwest 11 100.0 51.713 25.828 9.107 4.172 1.938 0.84
San Diego St. south 8 100.0 49.116 22.93 11.033 5.181 1.853 0.755
VCU west 7 100.0 49.03 21.187 9.927 4.523 1.995 0.744
Maryland midwest 4 100.0 53.53 26.132 10.837 4.056 1.82 0.735
Butler midwest 6 100.0 48.287 23.838 8.227 3.593 1.688 0.694
Davidson south 10 100.0 46.041 21.646 10.642 4.871 1.683 0.678
Oklahoma St. west 9 100.0 50.857 22.453 11.343 4.682 1.87 0.671
North Carolina St. east 8 100.0 51.914 22.439 11.07 4.813 1.896 0.611
St. John's south 9 100.0 50.884 23.054 10.411 4.442 1.496 0.575
Stephen F. Austin south 12 100.0 45.169 22.367 9.957 4.368 1.439 0.568
Georgia east 10 100.0 46.635 19.336 9.623 4.499 1.597 0.567
Oregon west 8 100.0 49.143 20.518 9.907 4.144 1.678 0.559
UCLA south 11 100.0 49.455 24.063 10.453 4.388 1.575 0.554
LSU east 9 100.0 48.086 20.5 9.842 4.191 1.708 0.551
Mississippi west 11 100.0 45.998 21.001 8.299 3.623 1.44 0.541
Cincinnati midwest 8 100.0 52.308 20.082 10.116 3.18 1.382 0.541
Dayton east 11 100.0 45.983 23.105 9.306 3.898 1.639 0.495
Purdue midwest 9 100.0 47.692 18.407 8.311 2.991 1.216 0.453
Buffalo midwest 12 100.0 44.561 22.616 8.681 2.764 1.099 0.392
Indiana midwest 10 100.0 44.978 13.846 5.745 2.269 0.928 0.344
Harvard west 13 100.0 44.202 21.121 7.51 2.602 0.998 0.339
Georgia St. west 14 100.0 42.575 19.898 7.477 2.894 1.147 0.321
Valparaiso midwest 13 100.0 46.47 20.765 7.922 2.461 0.91 0.286
Wyoming east 12 100.0 41.638 18.284 7.123 2.652 0.927 0.268
UC Irvine east 13 100.0 43.428 17.522 6.975 2.726 0.965 0.263
Wofford west 12 100.0 44.957 17.958 6.914 2.297 0.81 0.241
UAB south 14 100.0 41.565 16.886 6.077 2.033 0.673 0.222
Northeastern midwest 14 100.0 41.027 17.188 5.238 1.867 0.626 0.182
Eastern Washington south 13 100.0 43.649 16.433 5.605 1.968 0.527 0.168
New Mexico St. midwest 15 100.0 28.585 11.258 4.293 1.566 0.487 0.165
Albany east 14 100.0 42.442 16.03 5.183 1.709 0.542 0.151
North Dakota St. south 15 100.0 37.443 13.682 5.084 1.791 0.499 0.145
Lafayette east 16 100.0 36.205 14.848 5.277 1.578 0.462 0.142
Belmont east 15 100.0 34.924 12.751 5.231 1.821 0.531 0.139
Coastal Carolina west 16 100.0 35.524 13.963 5.18 1.746 0.538 0.138
Robert Morris south 16 100.0 36.934 13.371 4.704 1.44 0.372 0.1
Texas Southern west 15 100.0 34.978 12.189 4.093 1.391 0.393 0.085
Hampton midwest 16 100.0 30.893 10.867 3.661 0.955 0.262 0.069

Looks good! I'm in!

Evolution of a Learning Goal

I just got back from the Lilly Conference on College Teaching. The first workshop I went to was on course design. One chunk of this was on learning goals. This came in the second half of the workshop, after we’d talked quite a bit about learning factors, etc. I chose to work on my 200-level Biophysics class, where I thought I had decent goals already. One of the ones I particularly liked was

Use simple physical models to provide quantitative insight into biological systems.

The people running the workshop have some good reasons for insisting on particular phrasing, and they recommend either “students will be able to [verb] [object]” or “given [thing] students will be able to [other thing].” Fair enough, if I lighten up a bit on the need for an action verb, I get

Students will be able to use simple physical models to provide quantitative insight into biological systems.

We talked a bit about how many goals a class should have. Their book recommends 3-5. If you have a ton more than that, you’re listing specific objectives rather than overarching goals. One of the authors/instructors actually recommends a single goal. If you can do it, it gives you a really nice way to focus the class, makes you comfortable cutting material in the interest of deep(er) learning, etc. The danger, of course, is that you don’t want a goal that’s general enough as to be meaningless.

We also talked about the time-frame of a goal. If you’re setting a 5-year goal (i.e. what do you want them to be like in 5 years), they have 5 years between now and the goal, so you don’t get to take credit (or responsibility) for the whole goal; you just need to plant a seed.

In thinking more about the class and about attitudes, it became clear that I really wanted the students to be able to identify a problem and play with it.

Students will be able to identify a biological problem and play with it, using simple physical models to provide quantitative insight.

At this point, we talked a bit more about the purpose of goals, and the idea that you want the class to be committed to the goals, and I was introduced to what may be my new favorite fable/punchline: “in a ham and egg breakfast, the chickin is involved; the pig is committed.”

Anyway, the above goal is sounding wordy, and is really trying to do too much at once. Worse, “play with it” is maybe clear to the instructor, but really unlikely to be clear to the students. There are at least two things going on here, and they’re worth separating: I’ll have to teach them as separate skills, students will certainly learn them as separate skills, and I should be explicit.

  • Given an identified biological problem, students will be able to identify several categories of physical models with which the problem might be addressed.
  • Given a biological problem and a physical model, students will be able to provide quantitative insight into the system.

At this point, it’s close, but there’s a bit more work to be done with categorizing the physical model(s). I mean, the students aren’t going to be dropping in an extraordinarily complex set of models for a 200-level class that requires only a semester of calc-based mechanics as a pre-req:

  • Given a biological system and an appropriate simple physical model, students will be able to adapt the model to the system, providing quantitative insight.
  • Given a biological problem, students will be able to identify several [candidate? potential?] physical models to apply to the system.

The idea of “adapting the model to the system” was an important addition, and a clarification of “play with.” It’s also something that I expect most students to have relatively little experience with, especially in an interdisciplinary context! In addition to a few other things, this was where I started monkeying with the “quantitative insight” wording, and where to put it in the goal.

  • Given a biological system, students will be able to name the appropriate physical models that apply.
  • Given a biological system, take a physical model and appropriately adapt it through quantitative analysis to that specific system.

tweak a bit

  • Given a biological system and a physical model, students will be able to appropriately adapt the model to the system to provide quantitative [insight?] analysis.
  • Given a biological system, students will be able to name appropriate physical models.

That’s pretty close. My penultimate version was a combination of the last few.

  • Given a biological system and a physical model, students will be able to adapt the model to the system through quantitative analysis.
  • Given a biological system, students will be able to name relevant physical models.

“… be able to name …” doesn’t sound so strong, even though “relevant” carries a lot of weight. After looking at Bloom’s Taxonomy of Educational Objectives, a simple switch from “name” to “predict” captures what I want, and makes the goal much higher-level. I’ve added the third (and final) goal here:

  • Given a biological system and a physical model, students will be able to adapt the model to the system through quantitative analysis.
  • Given a biological system, students will be able to predict relevant physical models.
  • Students will gain exposure to important questions in the modern field of molecular biophysics, and evaluate current research on a system of their choice.

That’s much better, both for me and the students! And that’s despite the fact that my original goal (“use simple physical models to provide quantitative insight into biological systems”) still sounds decent. I wonder if I should use that phrasing, and perhaps something about “play with a model” in the course description rather than the goals.

Lookup Tables

[twitter-follow username="mglerner" scheme="dark"]

This whole post is available as an IPython Notebook here.

Lookup Tables with Lagrangian Interpolation

One of my students wanted to speed up the calculation of exp(x) in a simulation. There are a few ways to do this, but a lookup table is often a huge win in situations like this. The basic idea is that, for an expensive function like exp(x), you pre-calculate exp(x) for a bunch of values that cover the range in which you're interested. You then look things up in the table at runtime. If the exact value you want isn't in the table, you use a cheap interpolation function. By tweaking the density of your pre-calculated values and the sophistication of your interpolation function, you can get results that are quite close to exact calculations for a fraction of the run-time cost.

Sadly for me, I didn't know a bunch about which interpolation functions to use, so I asked Andy Simmonett. I wrote the Python bits below, but the general explanation is direct from him, with some light modifications. He's a QM/MM guy, so some of what is written below should be taken in the context of molecular simulations.

Before anything else, let's set up Python and use Seaborn for good-looking default plotting parameters.

In [1]:
from __future__ import division
import sys, os
import numpy as np, scipy as sp, pandas as pd, matplotlib as mpl
from scipy import stats
import matplotlib.pyplot as plt
import seaborn
seaborn.set()

%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py
%load_ext watermark
%watermark -v -m -p numpy,pandas,scipy,matplotlib
%matplotlib inline
Installed watermark.py. To use it, type:
  %load_ext watermark
CPython 2.7.8
IPython 2.3.0

numpy 1.9.1
pandas 0.14.1
scipy 0.14.0
matplotlib 1.4.2

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 3.16.2-200.fc20.x86_64
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit

The General strategy

I [Andy] don’t know of any decent source about lookup tables, but here are some notes that demonstrate how we figured out the splines. The simplest approach to understand is Lagrangian Interpolation wiki wolfram, which is the approach that we used because it’s so general. The strategy is as follows:-

  1. Choose a range of inputs values (∆E/kT in your case) that you expect to encounter frequently (say, 0 to 3). I’ve defined the function below such that an input of the positive argument ∆E/kT returns Exp[-∆E/kT].

  2. Choose a spacing value (referred to below as del) between successive grid points; this will determine both the storage needed for the table and also the accuracy of the interpolation, so some experimentation is necessary. The example below uses del=0.1, but generally you need at least a hundred grid points per unit to get single precision accuracy, and even more for double precision.

  3. You need to allocate (range/del+1) grid points to hold the table of values. For a quartic interpolation (see below) you need two extra points, to handle the end points (-0.1 and 3.1 in the example above, so you can interpolate the full range).

  4. Now you need to construct your table: Tab = {Exp[0.1], Exp[0], Exp[-0.1], Exp[-0.2], …, Exp[-3.1]}

  5. For a given input, x, if it’s outside the range of your interpolation table, just explicitly compute and return Exp[-x].

  6. If it’s inside the range, use an interpolating polynomial (below) to interpolate the values.

In [2]:
from numpy import exp,sqrt,sin
def gettable(start,stop,d,f):
    return f(np.arange(start-2*d,stop+2*d,d)) # enough for quintic

Wikipedia has some good info on Lagrangian interpolation (see links above). I’ve [Andy] pasted the explicit code needed for cubic, quartic, and quintic splines; these were obtained using the corresponding Mathematica inputs. Most codes use cubic splines for efficiency, but we found that quartic splines let you use a coarser table, so they may be more cache effiecient.

  • cubic

     Simplify[InterpolatingPolynomial[{{x0 - del, e0}, {x0, e1}, {x0 + del, e2}}, x]]
    
     (2 del^2 e1 - del (e0 - e2) (x - x0) + (e0 - 2 e1 + e2) (x - x0)^2)/(2 del^2)
    
  • quartic

     Simplify[InterpolatingPolynomial[{{x0 - del, e0}, {x0, e1}, {x0 + del, e2}, {x0 + 2 del, e3}}, x]]
    
     -((-6 del^3 e1 + del^2 (2 e0 + 3 e1 - 6 e2 + e3) (x - x0) - 3 del (e0 - 2 e1 + e2) (x - x0)^2 + (e0 - 3 e1 + 3 e2 - e3) (x - x0)^3)/(6 del^3))
    
  • quintic

     Simplify[InterpolatingPolynomial[{{x0 - 2 del, e0}, {x0 - del, e1}, {x0, e2}, {x0 + del, e3}, {x0 + 2 del, e4}}, x]]
    
     (1/(24 del^4))(24 del^4 e2 + 2 del^3 (e0 - 8 e1 + 8 e3 - e4) (x - x0) - 
      del^2 (e0 - 16 e1 + 30 e2 - 16 e3 + e4) (x - x0)^2 - 
      2 del (e0 - 2 e1 + 2 e3 - e4) (x - x0)^3 + (e0 - 4 e1 + 6 e2 - 4 e3 + 
      e4) (x - x0)^4)
    

Assume we’re using a quartic polynomial, and want to compute Exp[-0.22]. We need to pick out the 4 consecutive table entries {e0,e1,e2,e3} whose x values bound 0.22, so that we’re interpolating, not extrapolating. This is shown below (# represents the value we’re interested in).

  e0    e1     e2     e3
  |      | #    |      |
  ----------------------
x0-del   x0  x0+del x0+2del
 0.1    0.2    0.3    0.4

The values {e0,e1,e2,e3} are {Exp[-0.1],Exp[-0.2],Exp[-0.3],Exp[-0.4]}, taken straight from the table. The we just plug in (x-x0 = 0.22-0.2 = 0.02) into the quartic formula above, along with the tablulated e* values, and out comes the answer.

For an odd-order interpolating spline, make sure the value you're probing is in the middle of the range of table values, although it doesn’t matter which side of the center point if falls on; make sure the corresponding sign is correct when computing x-x0 though.

This is using Lagrangian interpolation. PME uses and Euler interpolating spline instead, so that might be worth investigating. I’ve attached the paper in case you’re interested. Note that the method above works for absolutely any function; you just have to tabulate the compute values ahead of time, which is cheap. To test the table, you can simply probe all of the midpoints (0.05, 0.15, 0.25, … in the example above) and compare the interpolated and analytic function. If it’s something that’s commonly run in single precision anyway, you should be able to get away with errors around 10^-7 or 10^-8. We get 10^-13, in our code, which is close enough to the machine epsilon for our liking.

Note I also stuck in linear interpolation for comparison.

In [3]:
def cubic(x,f,table,start,stop,d):
    try:
        #i0 will be the table index of the largest element lower than x.
        i0 = int((x - start)//d) + 2 # because we have two extra entries
        x0 = start + d*(i0 -2)
        e0, e1, e2 = table[i0-1:i0+2]
        #print("{i0} {x0} {e0} {e1} {e2} {e3}".format(**locals()))
        return (2*e1*d**2 - d*(e0 - e2)*(x - x0) + (e0 - 2*e1 + e2)*(x - x0)**2)/(2*d**2)
    # You'd think IndexError, but it comes from grabbing N table entries.
    # If the start or stop value is too high, you just won't be able to extract
    # The full three (in this case) entries from the table error, so the
    # e0, e1, e2 = ... line will raise a ValueError
    except ValueError: 
        return f(x)
def quartic(x,f,table,start,stop,d):
    try:
        i0 = int((x - start)//d) + 2
        x0 = start + d*(i0 -2)
        e0, e1, e2, e3 = table[i0-1:i0+3]
        return -((-6*e1*d**3 + d**2 * (2*e0 + 3*e1 - 6*e2 + e3)*(x - x0) 
                  - 3*d*(e0 - 2*e1 + e2)*(x - x0)**2 
                  + (e0 - 3*e1 + 3*e2 - e3)*(x - x0)**3)/(6*d**3))
    except ValueError:
        return f(x)
def quintic(x,f,table,start,stop,d):
    try:
        i0 = int((x - start)//d) + 2
        x0 = start + d*(i0 -2)
        e0, e1, e2, e3, e4 = table[i0-2:i0+3]
        return (1/(24*d**4))*( 24*d**4*e2 + 2*d**3*(e0 - 8*e1 + 8*e3 - e4)*(x - x0) - 
                d**2*(e0 - 16*e1 + 30*e2 - 16*e3 + e4)*(x - x0)**2 - 
                2*d*(e0 - 2*e1 + 2*e3 - e4)*(x - x0)**3 + 
                (e0 - 4*e1 + 6*e2 - 4*e3 + e4)*(x - x0)**4 )
    except ValueError:
        return f(x)
# and why not
def linear(x,f,table,start,stop,d):
    try:
        i0 = int((x - start)//d) + 2
        x0 = start + d*(i0 -2)
        e0, e1,  = table[i0:i0+2]
        #m = (e1-e0)/d
        #b = e0 - m*x0
        #return m*x + b
        return e0 + (x-x0)*(e1-e0)/d
    except ValueError:
        return f(x)
    

def getinterp(order):
    return {2:linear,3:cubic,4:quartic,5:quintic}[order]
def getinterpname(order):
    return {2:'linear',3:'cubic',4:'quartic',5:'quintic'}[order]

So let's test, giving each one values inside the range as well as on both sides

In [4]:
start,stop,d = 0,3,0.1
def f(x):
    return exp(-x)
table = gettable(start,stop,d,f)
def printapprox(order,val,f):
    interp = getinterp(order)
    exact,interpd = f(val),interp(val,f,table,start,stop,d)
    pdiff = abs(100 - 100*interpd/exact)
    print("exact {a} approx {b} %diff {c}".format(a=exact,b=interpd,c=pdiff))
def getf(name):
    def fexp(x):
        return exp(-x)
    def fsqr(x):
        return sqrt(x)
    def fsin(x):
        return sin(4*x)
    return {'exp':fexp,'sqrt':sqrt,'sin':fsin}[name]

print("Linear")
printapprox(2,.22,f)
printapprox(2,-.22,f)
printapprox(2,5.22,f)
print("Cubic")
printapprox(3,.22,f)
printapprox(3,-.22,f)
printapprox(3,5.22,f)
print("Quartic")
printapprox(4,.22,f)
printapprox(4,-.22,f)
printapprox(4,5.22,f)
print("Quintic")
printapprox(5,.22,f)
printapprox(5,-.22,f)
printapprox(5,5.22,f)
Linear
exact 0.802518797962 approx 0.803148246599 %diff 0.0784341298732
exact 1.24607673059 approx 1.24607673059 %diff 0.0
exact 0.00540732912644 approx 0.00540732912644 %diff 0.0
Cubic
exact 0.802518797962 approx 0.802492715994 %diff 0.00325001342684
exact 1.24607673059 approx 1.24607673059 %diff 0.0
exact 0.00540732912644 approx 0.00540732912644 %diff 0.0
Quartic
exact 0.802518797962 approx 0.802517668788 %diff 0.000140703834049
exact 1.24607673059 approx 1.24607673059 %diff 0.0
exact 0.00540732912644 approx 0.00540732912644 %diff 0.0
Quintic
exact 0.802518797962 approx 0.802518849726 %diff 6.45019096623e-06
exact 1.24607673059 approx 1.24607673059 %diff 0.0
exact 0.00540732912644 approx 0.00540732912644 %diff 0.0

That looks good. It's worth noting that this algorithm gives bad results in at least one clear case: for sqrt, the interpolations that try to use values for x < 0 will all give nan. One easy fix would be to test for a nan return value and just return the exact value. You'd have to consider how close your actual values are likely to be to zero, though, as that test will have to run for every lookup. There are probably smarter fixes. Here's a demonstration:

In [5]:
table = gettable(start=0,stop=3,d=0.1,f=getf('sqrt'))
cubic(x=0.01,f=getf('sqrt'),table=table,start=0,stop=3,d=0.1)
Out[5]:
nan

Now let's look at this visually.

Note that I cap the lower axis at 150%. When you're calculating numbers very close to zero (see sin for clear examples), the percentage error can get huge, and you might want to use other techniques (Taylor expansion, etc.).

In [6]:
from IPython.html import widgets 
from IPython.html.widgets import interact

figsize=(10,10)
def showinterp(d,order,fname='exp',rangeexp=0):
    fig = plt.figure(figsize=figsize)
    f = getf(fname)
    start,stop = 0,3
    start,stop,d = start*10**rangeexp,stop*10**rangeexp,d*10**rangeexp
    x = np.arange(start,stop,d/100.0)
    table = gettable(start,stop,d,f)
    exact = f(x)
    # Make these ufuncs for some huge speedup
    interp = getinterp(order)
    interpd = np.array([interp(i,f,table,start,stop,d) for i in x])
    #print(interpd[:10])
    err = interpd - exact
    errfrac = err/exact
    xerrfrac = x[~np.isnan(errfrac)] # for plotting later
    errfrac = errfrac[~np.isnan(errfrac)]
    errfrac = np.abs(errfrac)
    perr = np.abs(100*errfrac)
    plt.subplot(3,1,1)
    plt.plot(x,exact,'k',label='exact',lw=3)
    plt.plot(x,interpd,'r-',label='{n} interp d {d}'.format(n=getinterpname(order),d=d))
    plt.legend(fancybox=True)
    plt.subplot(3,1,2)
    plt.plot(x,err,'k-',label='error')
    plt.legend(fancybox=True)
    plt.subplot(3,1,3)
    plt.plot(xerrfrac,perr,'k-',label='% error')
    if plt.ylim()[1] > 151:
        plt.ylim(plt.ylim()[0],151)
    plt.legend(fancybox=True)
    #plt.grid(True)
    m = np.max(err[~np.isnan(err)])
    a = np.average(err[~np.isnan(err)])
    mp = np.max(perr)
    ap = np.average(perr)
    plt.xlabel('{order} {d} {start} {stop} Max err {m:g}[{mp:g}%] avg {a:g}[{ap:g}%]'.format(**locals()))
    return fig
interact(showinterp,d=widgets.FloatSliderWidget(min=.001,max=.5,step=.001,value=.1),
         order=(2,5),
         fname=['exp','sqrt','sin'],
         rangeexp=widgets.IntSliderWidget(min=-20,max=0,step=20,value=0)
         )

That's awesome in a live notebook. Let's do something similar that we can look at statically. There are two main choices here: JSAnimation and IPy-Widgets, both by the indomitable Jake VanderPlas. I really like the ability to play things as a movie, but ipywidgets wins for now because I can do dropdowns. You can animate it by draggin the sliders. Hey, it's like a flipbook! Rangeexp will let you play with the range, so you can see what happens when you look over small divisions between floating point numbers.

In [7]:
from ipywidgets import StaticInteract, RangeWidget, RadioWidget, DropDownWidget
figsize=(6,6)

StaticInteract(showinterp,d=RadioWidget([0.001, 0.0025, 0.005, 0.0075, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5],default=0.1),
               order=RangeWidget(2,5,1,default=3),
               fname=DropDownWidget(['exp','sqrt','sin'],default='exp'),
               rangeexp=RadioWidget([-20,0],default=0),
               )
/home/mglerner/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.py:424: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Out[7]:
d: 0.001: 0.0025: 0.005: 0.0075: 0.01: 0.025: 0.05: 0.075: 0.1: 0.25: 0.5:
fname:
order:
rangeexp: -20: 0:

Speed Testing

The obvious next thing to do here is to compare the speed of explicit calculations vs. table lookups. I think Python is likely a bad language for that; if you're doing this, you'll likely be doing it in C or FORTRAN, and you should just test the code explicitly there. Speed numbers from Python will almost surely be contaminated by Python-specific overhead. If you're using Python for time-critical code, you should probably check out The IPython Cookbook or Python High Performance Programming.

In [1]:
%%html
    <div id="disqus_thread"></div>
    <script type="text/javascript">
        /* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */
        var disqus_shortname = 'mathematicalphysics'; // required: replace example with your forum shortname

        /* * * DON'T EDIT BELOW THIS LINE * * */
        (function() {
            var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
            dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
            (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
        })();
    </script>
    <noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
(do the above for comments via the nbviewer)
In []:
 

Making IPython Notebooks for the matplotlib examples

[twitter-follow username="mglerner" scheme="dark"]

matplotlib comes with tons of fantastic examples. I’m not as familiar with matplotlib as I probably should be, so I often find myself wanting to tinker a bit, but needing to refer to those examples. Since matplotlib comes with such wonderful documentation, I though it would be great to just turn those docs into IPython Notebooks for easy tinkering. That’s probably biting off a bit more than I want to chew at the moment, considering that the matplotlib docs are fairly involved and written in reStructuredText instead of markdown (what the IPython Notebook uses).

Luckily, the IPython Notebook format is so mind-bendingly sane that I didn’t even need to read any documentation to understand it. So, instead, I wrote a bit of code that gobbles up matplotlib example scripts and spits out IPython Notebooks. The whole notebook is JSON, but I only want simple things, so I hardcode everything except for the cells. (After Daniel’s comment below, I started to write my own JSONEncoder. Then, I realized that I was right about the “it’s all JSON” thing and rewrote the notebook class). I have a little IPyNB class that knows how to add cells to itself and spit out the results as strings and files:

import os, json, glob
        
class IPyNB():
    def __init__(self,name):
        self.name = name
        self.cells = []
        self.d = {‘metadata’:{‘name’:”},
                  ‘nbformat’:3,
                  ‘nbformat_minor’:0,
                  ‘worksheets’:[{'cells':[],’metadata’:{}}]
                  }
        
    def addcell(self,celltype,content):
        cell = {‘cell_type’:celltype,
                  ‘metadata’:{}
                  }
        if celltype == ‘code’:
            cell['collapsed'] = False
            cell['input'] = content.split(‘\n’)
            cell['language'] = ‘python’
            cell['outputs'] = []
        elif celltype == ‘markdown’:
            cell['source'] = content.split(‘\n’)
        #elif self.celltype = ‘raw’:
        else:
            raise NotImplemented(‘Unknown cell type: {celltype}’.format(celltype=self.celltype))
        self.d['worksheets'][0]['cells'].append(cell)

    def tostring(self):
        return json.dumps(self.d,
                          sort_keys=True,
                          indent=1, separators=(‘,’, ‘: ‘),
            )

    def tofile(self,outdir=’.',overwrite=False):
        fname = os.path.join(outdir,self.name+’.ipynb’)
        if os.path.isfile(fname) and not overwrite:
            raise IOError(“File {fname} already exists”.format(fname=fname))
        f = open(fname,’w')
        f.write(self.tostring())
        f.close()

I’ve defined a few celltypes. It’s easy to add more if there are more.

And then a couple of functions to read in the matplotlib examples and spit out notebooks:

def make_mpl_examples(subdir=’images_contours_and_fields’,basedir=’~/coding/matplotlib/examples’,outdir=’.',overwrite=False):
    n = IPyNB(subdir + ‘ Examples’)
    n.addcell(‘markdown’,”"”#matplotlib examples
The below examples are taken directly from the matplotlib example directory {subdir} and rendered in an IPython Notebook. Before you run them, you should execute one of the first two cells, depending on whether you want inline rendering or not. The inline rendering is usually much nicer for interactive work, but doesn’t support all features (e.g. animation).”"”.format(subdir=subdir)
    n.addcell(‘code’,'%matplotlib inline’)
    n.addcell(‘code’,'%matplotlib’)
    pat = os.path.join(os.path.expanduser(basedir),subdir,’*.py’)
    examples = glob.glob(pat)
    if not examples:
        raise IOError(‘No files found: {pat}’.format(pat=pat))
    for ex in examples:
        n.addcell(‘markdown’,'## {exname}’.format(exname=ex))
        n.addcell(‘code’,open(ex).read())
    n.tofile(overwrite=overwrite,outdir=outdir)

def make_all_mpl_examples(outdir=’output’,basedir=’~/coding/matplotlib/examples’,overwrite=False):
    # There must be a builtin!
    bd = os.path.expanduser(basedir)
    subdirs = [d for d in os.listdir(bd) if os.path.isdir(os.path.join(bd,d))]
    for s in subdirs:
        print “Doing”,s
        make_mpl_examples(subdir=s,basedir=basedir,outdir=outdir,overwrite=overwrite)

It spits out one notebook per example directory, but is easy to change. It does what I want very nicely: gives me an immediate way to tweak the matplotlib examples. I’ll leave the bigger project (turning all of the matplotlib docs into IPython Notebooks) for later, if ever. The first thing I’d need to figure out is how to execute the code in the cells programatically. I’m sure there’s an API.

Here’s an example of some examples. I’ve executed several of the cells so that there’s output. This is significantly less interesting on the web, and more interesting if you run it in a notebook on your own so that you can tweak things.

matplotlib examples

The below examples are taken directly from the matplotlib example directory images_contours_and_fields and rendered in an IPython Notebook. Before you run them, you should execute one of the first two cells, depending on whether you want inline rendering or not. The inline rendering is usually much nicer for interactive work, but doesn't support all features (e.g. animation).

In [1]:
%matplotlib inline
In []:
%matplotlib

/Users/mglerner/coding/matplotlib/examples/images_contours_and_fields/image_demo.py

In [2]:
"""
Simple demo of the imshow function.
"""
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook

image_file = cbook.get_sample_data('ada.png')
image = plt.imread(image_file)

plt.imshow(image)
plt.axis('off') # clear x- and y-axes
plt.show()

/Users/mglerner/coding/matplotlib/examples/images_contours_and_fields/image_demo_clip_path.py

In [3]:
"""
Demo of image that's been clipped by a circular patch.
"""
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.cbook as cbook


image_file = cbook.get_sample_data('grace_hopper.png')
image = plt.imread(image_file)

fig, ax = plt.subplots()
im = ax.imshow(image)
patch = patches.Circle((260, 200), radius=200, transform=ax.transData)
im.set_clip_path(patch)

plt.axis('off')
plt.show()

/Users/mglerner/coding/matplotlib/examples/images_contours_and_fields/pcolormesh_levels.py

In [4]:
"""
Shows how to combine Normalization and Colormap instances to draw
"levels" in pcolor, pcolormesh and imshow type plots in a similar
way to the levels keyword argument to contour/contourf.

"""

import matplotlib.pyplot as plt
from matplotlib.colors import BoundaryNorm
from matplotlib.ticker import MaxNLocator
import numpy as np


# make these smaller to increase the resolution
dx, dy = 0.05, 0.05

# generate 2 2d grids for the x & y bounds
y, x = np.mgrid[slice(1, 5 + dy, dy),
                slice(1, 5 + dx, dx)]

z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

# x and y are bounds, so z should be the value *inside* those bounds.
# Therefore, remove the last value from the z array.
z = z[:-1, :-1]
levels = MaxNLocator(nbins=15).tick_values(z.min(), z.max())


# pick the desired colormap, sensible levels, and define a normalization
# instance which takes data values and translates those into levels.
cmap = plt.get_cmap('PiYG')
norm = BoundaryNorm(levels, ncolors=cmap.N, clip=True)

plt.subplot(2, 1, 1)
im = plt.pcolormesh(x, y, z, cmap=cmap, norm=norm)
plt.colorbar()
# set the limits of the plot to the limits of the data
plt.axis([x.min(), x.max(), y.min(), y.max()])
plt.title('pcolormesh with levels')



plt.subplot(2, 1, 2)
# contours are *point* based plots, so convert our bound into point
# centers
plt.contourf(x[:-1, :-1] + dx / 2.,
             y[:-1, :-1] + dy / 2., z, levels=levels,
             cmap=cmap)
plt.colorbar()
plt.title('contourf with levels')


plt.show()

/Users/mglerner/coding/matplotlib/examples/images_contours_and_fields/streamplot_demo_features.py

In [5]:
"""
Demo of the `streamplot` function.

A streamplot, or streamline plot, is used to display 2D vector fields. This
example shows a few features of the stream plot function:

    * Varying the color along a streamline.
    * Varying the density of streamlines.
    * Varying the line width along a stream line.
"""
import numpy as np
import matplotlib.pyplot as plt

Y, X = np.mgrid[-3:3:100j, -3:3:100j]
U = -1 - X**2 + Y
V = 1 + X - Y**2
speed = np.sqrt(U*U + V*V)

plt.streamplot(X, Y, U, V, color=U, linewidth=2, cmap=plt.cm.autumn)
plt.colorbar()

f, (ax1, ax2) = plt.subplots(ncols=2)
ax1.streamplot(X, Y, U, V, density=[0.5, 1])

lw = 5*speed/speed.max()
ax2.streamplot(X, Y, U, V, density=0.6, color='k', linewidth=lw)

plt.show()

/Users/mglerner/coding/matplotlib/examples/images_contours_and_fields/streamplot_demo_masking.py

In [6]:
"""
Demo of the streamplot function with masking.

This example shows how streamlines created by the streamplot function skips
masked regions and NaN values.
"""
import numpy as np
import matplotlib.pyplot as plt

w = 3
Y, X = np.mgrid[-w:w:100j, -w:w:100j]
U = -1 - X**2 + Y
V = 1 + X - Y**2
speed = np.sqrt(U*U + V*V)

mask = np.zeros(U.shape, dtype=bool)
mask[40:60, 40:60] = 1
U = np.ma.array(U, mask=mask)
U[:20, :20] = np.nan

plt.streamplot(X, Y, U, V, color='r')

plt.imshow(~mask, extent=(-w, w, -w, w), alpha=0.5,
           interpolation='nearest', cmap=plt.cm.gray)

plt.show()
In []:

Comments from old blog

2 Responses to Making IPython Notebooks for the matplotlib examples

  1. Daniel says:

    Your IPyNB class could be made a lot more readable by using a custom JSON encoder, e.g. extending JSONEncoder http://docs.python.org/2/library/json.html#json.JSONEncoder. That way, you could store the preamble/afterward as a Python dict and just call json.dumps(..., default=MyJSONEncoder) inside tostring().

Histograms and Kernel Density Estimation (KDE)

[twitter-follow username="mglerner" scheme="dark"]

(Note: you can download the IPython Notebook here.)

Why histograms

As we all know, Histograms are an extremely common way to make sense of discrete data. Whether we mean to or not, when we're using histograms, we're usually doing some form of density estimation. That is, although we only have a few discrete data points, we'd really pretend that we have some sort of continuous distribution, and we'd really like to know what that distribution is. For instance, I was recently grading an exam and trying to figure out what the underlying distribution of grades looked like, whether I should curve the exam, and, if so, how I should curve it.

I'll poke at this in an IPython Notebook; if you're doing this in a different environments, you may wish to uncomment out the commented lines so that your namespace is properly polluted.

In [1]:
from __future__ import division
%pylab inline

grades = array((93.5,93,60.8,94.5,82,87.5,91.5,99.5,86,93.5,92.5,78,76,69,94.5,89.5,92.8,78,65.5,98,98.5,92.3,95.5,76,91,95,61.4,96,90))
junk = hist(grades)
Populating the interactive namespace from numpy and matplotlib

Why not histograms?

We can play around with the number of bins, but it's not totally clear what's going on with the left half of the grades.

In [2]:
junk = hist(grades,5)
In [3]:
junk = hist(grades,15)

So, maybe the histogram isn't the perfect tool for the job at hand. In fact, there are quite a few well-known problems with histograms. Shodor has a really nice histogram activity that lets you play around with data interactively. Rather than using Java or JavaScript directly, Jake Vanderplas has a great package called JSAnimation that lets us animate things directly in IPython Notebooks. I'll cheat a bit: since all I really need for this is a single slider, I can use JSAnimation to let us interact with data very similarly to the Shodor pages.

In [4]:
from JSAnimation.IPython_display import display_animation, anim_to_html

Before we start, I'll load in a few data sets. If you're interested, you can rerun this notebook with a different data set to see how it affects things. data_shodor is the "My Data" set from their histogram activity page, data_sat is the average SAT Math data from the same page, data_tarn is from Tarn Duong's fantastic KDE explanation (we'll get there), and simple_data is just a very simple data set.

In [5]:
data_tarn = array((2.1,2.2,2.3,2.25,2.4,2.61,2.62,3.3,3.4,3.41,3.6,3.8))
data_shodor = array((49,49,45,45,41,38,38,38,40,37,37,34,35,36,35,38,38,32,32,32,37,31,32,31,32,30,30,32,30,30,29,28,29,29,29,30,28,27,29,30,28,27,28,27,27,29,29,29,26,27,25,25,25,25,25,25,25,26,26,27))
data_sat = array((490,499,459,575,575,513,382,525,510,542,368,564,509,530,485,521,495,526,474,500,441,750,495,476,456,440,547,479,501,476,457,444,444,467,482,449,464,501,670,740,590,700,590,450,452,468,472,447,520,506,570,474,532,472,585,466,549,736,654,585,574,621,542,616,547,554,514,592,531,550,507,441,551,450,548,589,549,485,480,545,451,448,487,480,540,470,529,445,460,457,560,495,480,430,644,489,506,660,444,551,583,457,440,470,486,413,470,408,440,596,442,544,528,559,505,450,477,557,446,553,370,533,496,513,403,496,543,533,471,404,439,459,393,470,650,512,445,446,474,449,529,538,450,570,499,375,515,445,571,442,492,456,428,536,515,450,537,490,446,410,526,560,560,540,502,642,590,480,557,468,524,445,479))
simple_data = array((0,5,10))
data = grades

Two of the main problems with histograms are (1) you need to define a bin size (2) you need to decide where the left edge of the bin is.

Histogram bin size

Let's look at the effects of bin size on histograms.

Caveat: the code below is certainly not optimized. Ditto for all of the code in this notebook. I wrote it quickly and at the same time I learned what FuncAnimation does. In order to make this read more easily, I've included most of the code at the end. If you're running this interactively, run the cell at the end now!

Let's start with getHistBinNumAni. What does that do? Given a data set, it'll give us an interactive plot. By dragging the slider around, we can make a histogram with anywhere from 1 bin to some max (default: 20) number of bins. No matter how many bins we have, the actual data is shown in blue dots near the bottom. Here's what it looks like for the grades:

In [7]:
ani = getHistBinNumAni(data)
display_animation(ani, default_mode='once')
Out[7]:


Once Loop Reflect

So, obviously chosing the number of bins makes a huge difference in how we'd interpret the data.

Where do the histogram bins start?

One of the other big problems with histograms, especially for relatively small data sets, is that you have to choose where the left edge of the first bin goes. Do you center the bin around the first group of points? Do you make the left edge match up with the left-most data point? Let's make some plots to see how that can affect things, because it's a bit easier to understand what I'm going on about that way. We'll make a similar animation with getHistBinOffsetAni. As with the previous animation, drag the slider around. This time, we have the same number of bins, but the slider drags around the data relative to the bins (or vice versa, depending on how you think of it).

In [8]:
ani = getHistBinOffsetAni(data)
display_animation(ani, default_mode='once')
Out[8]:


Once Loop Reflect

KDE (Kernel Density Estimation) to the rescue!

Kernel density estimation is my favorite alternative to histograms. Tarn Duong has fantastic KDE explanation, which is well worth reading. The basic idea is that, if you're looking at our simple dataset (simple_data = array((0,5,10)), you might choose to represent each point as a rectangle:

In [9]:
bar(simple_data,ones_like(simple_data)*0.5,width=0.5,facecolor='grey',alpha=0.5)
junk = ylim(0, 2.0)

not so interesting so far, but what do we do when the rectangles get wide enough that they start to overlap? Instead of just letting them run over each other like

In [10]:
bar(simple_data,ones_like(simple_data)*0.5,width=6,facecolor='grey',alpha=0.5)
junk = ylim(0, 2.0)

and instead of coloring the overlap regions darker grey, we add the rectangles together. So, since each of the rectangles has height 0.5 in the above example, the dark grey regions should really have height 1.0. This idea is called "kernel density estimation" (KDE), and the rectangle that we're using is called the "kernel". If we wanted to draw a different shape at each point, we'd do so by specifying a different kernel (perhaps a bell curve, or a triangle).

KDE, rectangular kernel

Now let's try KDE with a rectangular kernel. This time, using getKdeRectAni, you get a slider controls the width of the kernel.

In [11]:
ani = getKdeRectAni(simple_data)
display_animation(ani, default_mode='once')
Out[11]:


Once Loop Reflect

play with the slider, and note what happens when you make it big enough that the rectangles start to overlap. By tuning the width of the rectangles, we can tune how finely or coarsely we're looking at the data. It's not so powerful with three data points, but check it out with the grades from above:

In [12]:
ani = getKdeRectAni(data)
display_animation(ani, default_mode='once')
Out[12]:


Once Loop Reflect

In my view, there's a sweet spot right around 1/8 or 1/9 of the way across the slider where there are three distinct peaks. It looks very much like a trimodal distribution to me. So far, this isn't totally automatic; we have to pick the width of our kernel, but it's obvoius that KDE can give us a much better view of the underlying data than histograms!

KDE, Gaussian kernel

As mentioned above, we can use a different kernel. One of the most common kernels to use is a Gaussian. Using getKdeGaussianAni:

Again, the slider controls kernel width.

In [13]:
ani = getKdeGaussianAni(data)
display_animation(ani, default_mode='once')
Out[13]:


Once Loop Reflect

This gives us a really nice picture of the data. Play around with the slider and see what you think.

Kernel width

Not only does KDE give us a better picture than histograms, but there turn out to be actual answers to the question of "how wide should my kernel be?" You can see, for instance, that making the kernel too narrow doesn't provide much more information than the raw data, while making it too large oversmooths the data, making it mostly look like a single kernel with some bits on the sides.

Daniel Smith has a really nice KDE module that chooses an optimal bandwidth and can be used with SciPy (scipy does have its own KDE module, but I've found Daniel's to be quite robust).

Other data sets

I highly recommend just playing around with other data sets using the above code. I was interested in playing around with income data, so I show how to grab that data from the IRS website below and play around a bit without comment. Enjoy!

Income data

Let's grab the income data from The IRS and make some plots.

In [14]:
import urllib
f = urllib.urlopen("http://www.irs.gov/file_source/pub/irs-soi/09incicsv.csv")
#"State_Code","County_Code","State_Abbrv","County_Name",   "Return_Num","Exmpt_Num","AGI","Wages_Salaries","Dividends","Interest"
irs2009 = loadtxt(f,delimiter=',',skiprows=1,usecols=(4,5,6,7,8,9))
agi2009 = irs2009[:,2]

Now try things like

In [15]:
ani = getHistBinNumAni(agi2009)
display_animation(ani, default_mode='once')
Out[15]:


Once Loop Reflect

Whoops, that's hard to make sense of. Let's use logs

In [16]:
la2009 = log(agi2009)
la2009 = la2009[-isnan(la2009)]
-c:1: RuntimeWarning: invalid value encountered in log

In [17]:
ani = getHistBinNumAni(la2009)
display_animation(ani, default_mode='once')
Out[17]:


Once Loop Reflect
In [18]:
ani = getKdeRectAni(la2009)
display_animation(ani, default_mode='once')
Out[18]:


Once Loop Reflect
In [19]:
ani = getKdeGaussianAni(la2009)
display_animation(ani,default_mode='once')
Out[19]:


Once Loop Reflect

In order to make this read more easily, I've put the bulk of the code below. You'll have to run it before the previous cells.

In [6]:
#!/usr/bin/env python

from numpy import histogram as nphistogram
#from numpy import array, linspace, zeros, ones, ones_like
#import numpy as np
#import matplotlib.pyplot as plt
#from matplotlib.pyplot import figure, hist, plot, ion, axes, title
from JSAnimation.IPython_display import display_animation, anim_to_html

from matplotlib import animation as animation


def getHistBinNumAni(data,totalframes=None,showpts=True):
    #ion()
    if totalframes is None:
        totalframes = min(len(data)-1,100)
    fig = figure()
    ax = fig.gca()

    n, bins, patches = hist(data, totalframes, normed=1, facecolor='green', alpha=0.0)
    if showpts:
        junk = plot(data,0.2*ones_like(data),'bo')
    def animate(i):
        n, bins = nphistogram(data, i+1, normed=False)
        #print n
        ax.set_ylim(0,1.1*n.max())
        for j in range(len(n)):
            rect,h = patches[j],n[j]
            #print h.max()
            x = bins[j]
            w = bins[j+1] - x
            rect.set_height(h)
            rect.set_x(x)
            rect.set_width(w)
            rect.set_alpha(0.75)
        #fig.canvas.draw()
    
    ani = animation.FuncAnimation(fig, animate, totalframes, repeat=False)
    return ani

def getHistBinOffsetAni(data,nbins=20,showpts=True):
    offsets = linspace(-0.5,0.5,50)
    totalframes = len(offsets)
    fig = figure()
    ax = fig.gca()

    n, _bins, patches = hist(data, nbins, normed=1, facecolor='green', alpha=0.0)
    if showpts:
        junk = plot(data,0.2*ones_like(data),'bo')
    # Obnoxious: find max number in a bin ever
    nmax = 1
    for i in range(totalframes):
        dx = (data.max() - data.min())/nbins
        _bins = linspace(data.min() - dx + offsets[i]*dx, data.max()+dx + offsets[i]*dx,len(data)+1)
        n, bins = nphistogram(data, bins=_bins, normed=False)
        nmax = max(nmax,n.max())
                               
    def animate(i):
        dx = (data.max() - data.min())/nbins
        # bins go from min - dx to max + dx, then offset.
        _bins = linspace(data.min() - dx + offsets[i]*dx, data.max()+dx + offsets[i]*dx,nbins)
        n, bins = nphistogram(data, bins = _bins, normed=False)
        ax.set_ylim(0,1.1*nmax)
        #ax.set_xlim(data.min()-dx,data.max()+dx)
        binwidth = bins[1] - bins[0]
        ax.set_xlim(bins[0]-binwidth,bins[-1] + binwidth)

        for j in range(len(n)):
            #continue
            rect,h = patches[j],n[j]
            #print h.max()
            x = bins[j]
            w = bins[j+1] - x
            rect.set_height(h)
            rect.set_x(x)
            rect.set_width(w)
            rect.set_alpha(0.75)
        fig.canvas.draw()    
    ani = animation.FuncAnimation(fig, animate, totalframes, repeat=False)
    return ani
#!/usr/bin/env python

from numpy import sqrt, pi, exp

def getKdeGaussianAni(data,totalframes=100, showpts=True):
    fig = figure()
    
    # Let's say 10000 points for the whole thing
    width = data.max() - data.min()
    left, right = data.min(), data.min() + (width)
    left, right = left - (totalframes/100)*width, right + (totalframes/100)*width
    
    ax = axes(xlim=(left,right),ylim=(-0.1,2))
    line, = ax.plot([], [], lw=2)
    if showpts:
        junk = plot(data,ones_like(data)*0.1,'go')

    
    numpts = 10000
    x = linspace(left,right,numpts)
    
    dx = (right-left)/(numpts-1)
    
    def init():
        line.set_data([], [])
        return line,
    
    def gaussian(x,sigma,mu):
        # Why isn't this defined somewhere?! It must be!
        return (1/sqrt(2*pi*sigma**2)) *  exp(-((x-mu)**2)/(2*sigma**2))
    
    def animate(i):
        y = zeros(10000)
        kernelwidth = .02*width*(i+1)
        kernelpts = int(kernelwidth/dx)
        kernel = gaussian(linspace(-3,3,kernelpts),1,0)
        #kernel = ones(kernelpts)
        for d in data:
            center = d - left
            centerpts = int(center/dx)
            bottom = centerpts - int(kernelpts/2)
            top = centerpts+int(kernelpts/2)
            if top - bottom < kernelpts: top = top + 1
            if top - bottom > kernelpts: top = top - 1
            y[bottom:top] += kernel
        ax.set_xlim(x[where(y>0)[0][0]],x[where(y>0)[0][-1]])
        line.set_data(x,y)
        ax.set_ylim(min(0,y.min()),1.1*y.max())
        #title('ymin %s ymax %s'%(y.min(),y.max()))

    
        #sleep(0.1)
        return line,
    ani = animation.FuncAnimation(fig, animate, init_func=init,
                                  frames=totalframes, repeat=False)
    return ani
#FACTOR ME for rect and gaussian
def getKdeRectAni(data,totalframes=100,showpts=True):
    #ion()
    totalframes = 100
    fig = figure()
    
    # Let's say 10000 points for the whole thing
    width = data.max() - data.min()
    left, right = data.min(), data.min() + (width)
    left, right = left - (totalframes/100)*width, right + (totalframes/100)*width
    
    ax = axes(xlim=(left,right),ylim=(-0.1,2))
    line, = ax.plot([], [], lw=2)
    
    numpts = 10000
    x = linspace(left,right,numpts)
    
    dx = (right-left)/(numpts-1)
    
    def init():
        line.set_data([], [])
        return line,

    if showpts:
        junk = plot(data,0.2*ones_like(data),'bo')
    
    def animate(i):
        y = zeros(10000)
        kernelwidth = .02*width*(i+1)
        kernelpts = int(kernelwidth/dx)
        kernel = ones(kernelpts)
        for d in data:
            center = d - left
            centerpts = int(center/dx)
            bottom = centerpts - int(kernelpts/2)
            top = centerpts+int(kernelpts/2)
            if top - bottom < kernelpts: top = top + 1
            if top - bottom > kernelpts: top = top - 1
            y[bottom:top] += kernel
        line.set_data(x,y)
        ax.set_ylim(0,1.1*y.max())
        ax.set_xlim(x[where(y>0)[0][0]],x[where(y>0)[0][-1]])
    
        #sleep(0.1)
        return line,
    ani = animation.FuncAnimation(fig, animate, init_func=init,
                                  frames=totalframes, repeat=False)
    return ani

And that's it. Cheers!

Comments from old blog

13 Responses to Histograms and Kernel Density Estimation (KDE)

  1. Arindam Paul says:

    Excellent !! One of the best explanations of KDE I have ever seen.
    This post has generated enough interest to read your other blogs. great job.

  2. Nils Wagner says:

    Assume that we have a spatial energy distribution given at discrete points in 3-D, i.e.

    E_i(x_i,y_i,z_i)

    where E_i denotes the energy and x_i,y_i,z_i are the corresponding coordinates.

    Is it possible to extract the local hot spots using scipy ?

    A small example is appreciated.

    Thanks in advance.

  3. domain says:

    It's really a cool and helpful piece of info. I'm happy that you simply shared this helpful information with us.
    Please stay us up to date like this. Thanks for sharing.

  4. Andreas says:

    Thanks for sharing your knowledge and interpretation of kernel density estimation with us. Very enlighting.

  5. gmas says:

    If I try to run your notebook, I get this name error:


    NameError Traceback (most recent call last)
    in ()
    ----> 1 ani = getHistBinNumAni(data)
    2 display_animation(ani, default_mode='once')

    NameError: name 'getHistBinNumAni' is not defined

  6. X says:

    Is there a way to fit data to an exponential distribution such that it maximizes the entropy H(p_i) = - sum p_i*log(p_i) where p_i is the probability of a given event?

  7. Pingback: Histogram to PDF (Edit)

  8. Pingback: Histogram to PDF (Edit)

  9. Ben says:

    Fantastic explanation!
    Best KDE description I've found so far!
    Keep up the good work!

  10. Just wanted to say this website is extremely good. I always want to hear new things about this because I’ve the similar blog during my Country with this subject which means this help´s me a lot. I did so a search around the issue and located a large amount of blogs but nothing beats this. Many thanks for sharing so much inside your blog..

Drum head normal modes, with movies

We’re working our way through Boas in my Mathematical Physics class, and we’ve come to the point in the PDE chapter where every good Physics student figures out what the normal modes of a circular drum head ought to look like. Punchline:

Drum heads

So, we get to the intuitive answer that we’re looking for $$J_n(k_{mn}r)\cos(n\theta)\cos(kvt)$$, where $$k_{mn}$$ is the $$m^{th}$$ zero of $$J_m$$ (we could also look for the $$\sin \sin$$, $$\sin \cos$$ or $$\cos \sin$$ solutions, but they’re not fundamentally different), and Boas has the standard picture of the nodelines to help with visualization. I’m becoming increasingly convinced that students should be competent with at least a little bit of Python. You could argue for Matlab or Mathematica, but I’m leaning heavily towards Python for a number of reasons[*]. The point is that things like “the sum of two specific traveling waves is a standing wave” are just easier to see if you can visualize and animate things, and they pack a bigger punch if you can do that yourself.

matplotlib animation interlude

I used standard matplotlib animations to show some basics with traveling waves, standing waves, Bessel functions, and building up series solutions piece by piece. The animation framework is excellent, because it’s simple to use and a traveling wave looks like this:

#!/usr/bin/env python
from __future__ import division
from numpy import sin, pi, linspace
import matplotlib.pyplot as plt
from matplotlib import animation

# First set up the figure, the axis, and the plot element we want to animate
v = -0.01
fig = plt.figure()
ax = plt.axes(xlim=(-4, 4), ylim=(-2, 2))
line, = ax.plot([], [], lw=2, color=’red’)

# initialization function: plot the background of each frame
def init():
    line.set_data([], [])
    return line,

# animation function.  This is called sequentially
x = linspace(-4, 4, 2000)

def animate(t):
    y = sin(2 * pi * (x – v * t))
    line.set_data(x, y)
    return line,

# call the animator.  blit=True means only re-draw the parts that have changed.
anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=20, blit=True)

plt.show()

I can talk the students through it, and there are really only two interesting lines: the v=-0.01 to set the velocity and the y = sin(2 * pi * (x – v * t)) to do the math. Students can change the sign of the velocity to see it move in the other direction, change the math line to have some series solution, etc. in a straightforward way. We did that with our solutions for plucked and hammered strings (guitar vs. piano). It made for a nice in-class demo, and I’l have the students more involved from the start of the semester next time. There’s code for simple waves in my github repo, and you can find lots of decent animations on the net.

back to drum heads

What you can’t seem to find (or, at least, I can’t Edit: I knew the web had to have a nice demo, and here’s one using Mathematica) is animated versions of the normal modes we want ($$J_n(k_{mn}r)\cos(n\theta)\cos(kvt)$$). The math part is easy, especially because scipy.special has such nice things built in. The main part just involves

from scipy.special import jn, jn_zeros
from numpy import cos

def k(m,n):
    return jn_zeros(n,m)[m-1] # m is 0-indexed here

result = jn(n,k(m,n)*R)*cos(n*theta)*cos(k(m,n)*v*t)

The animation part is a bit trickier because I really want 3D plots in addition to the 2D ones, and the matplotlib 3d canvas doesn’t play nicely with the animation framework. It’s still not particularly tricky. When we plot something, we get an object back. We usually ignore that object. In order to “animate” things, we just keep track of that object and delete it each time through our main loop. I ended up with two versions of the code. The one below has a couple of nice features. It animates a single mode, but you can specify that on the command line. Optionally, you can also specify “cc” for “cos cos”, “cs” for “cos sin” etc.

The funny business with zlim near the end is because matplotlib sets the z limits of the plots automatically, each time you plot something. In our case, it looks much smoother to just start the z axis at (-1,1) instead of watching it stretch throughout the first oscillation.

#!/usr/bin/env python
from __future__ import division
“”"
Based on the wireframe example script
“”"
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt
from matplotlib import cm
from numpy import sin, cos, arctan, arctan2, array, sqrt, linspace, meshgrid
import scipy
import scipy.special
from scipy.special import jn, jn_zeros
import time, sys, os

v = 1
n, m = 1, 1
if len(sys.argv) >= 3:
    n,m = [int(i) for i in sys.argv[1:3]]
if m == 0:
    sys.exit(“M must be greater than or equal to 1″)

if len(sys.argv) >= 4:
    fns = sys.argv[3]
else:
    fns = ‘cc’
    
f1 = {‘s’:sin,’c':cos}[fns[0]]
f2 = {‘s’:sin,’c':cos}[fns[1]]

def k(m,n):
    return jn_zeros(n,m)[m-1] # m is 0-indexed here

def generate(X, Y, t):
    theta = arctan2(Y,X) # This does arctan(Y/X) but gets the sign right.
    R = sqrt(X**2 + Y**2)
    # We know z = J_n(k*r)*cos(n*theta)*cos(k*v*t)
    # 
    result = jn(n,k(m,n)*R)*f1(n*theta)*f2(k(m,n)*v*t)
    result[R>1] = 0 # we plot points from the square, but physically require this.
    return result

plt.ion()
fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)

xs = linspace(-1, 1, 100)
ys = linspace(-1, 1, 100)
X, Y = meshgrid(xs, ys)
Z = generate(X, Y, 0.0)

wframe = None
tstart = time.time()
for t in linspace(0, 10, 400):
    oldcol = wframe
    Z = generate(X, Y, t)
    #wframe = ax.plot_wireframe(X, Y, Z, rstride=2, cstride=2)
    wframe = ax.plot_surface(X, Y, Z, rstride=4, cstride=4, alpha=0.3)

    # Remove old line collection before drawing
    if oldcol is not None:
        ax.collections.remove(oldcol)
    if n == 0:
        ax.set_zlim(-1,1)
    else:
        ax.set_zlim(-0.5,0.5)
    plt.draw()

The individual frames look something like

A frame from k11

I also made a beefier version (you can find it in my github repo) that plots several of the modes in a grid. The different columns are m=1,2,3 and rows are n=1,2. The script lets you plot as many or as few of these as you’d like. It shows a 2D version next to the 3D version, and the nodelines are shown in black on the 2D version.

I also followed The Glowing Python’s idea of writing out a bunch of png files for each frame and making a movie with ffmpeg:

ffmpeg -q:a 5 -r 5 -b:v 19200 -i img%04d.png movie.mp4

Here’s the movie:

and you can download a slightly higher quality version here. You can download a slightly higher quality of the movie at the top of this page (2 values each of m and n) here.

[*] Among other things:

  • Python is free, which is important at a small school
  • I want this to show up in a computational modeling class, and I want that class to draw in CS, biology, chemistry, biochemistry, math, econ and geology students. I think Python is an easier cross-department sell
  • The IPython Notebook is fantastic these days